UQA: Corpus for Urdu Question Answering
International Conference on Computational Linguistics(2024)
摘要
This paper introduces UQA, a novel dataset for question answering and text
comprehension in Urdu, a low-resource language with over 70 million native
speakers. UQA is generated by translating the Stanford Question Answering
Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called
EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in
the translated context paragraphs. The paper describes the process of selecting
and evaluating the best translation model among two candidates: Google
Translator and Seamless M4T. The paper also benchmarks several state-of-the-art
multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and
reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and
74.56 EM. UQA is a valuable resource for developing and testing multilingual
NLP systems for Urdu and for enhancing the cross-lingual transferability of
existing models. Further, the paper demonstrates the effectiveness of EATS for
creating high-quality datasets for other languages and domains. The UQA dataset
and the code are publicly available at www.github.com/sameearif/UQA.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要