ParaShoot: A Hebrew Question Answering Dataset

NLP research in Hebrew has largely focused on morphology and syntax, where rich annotated datasets in the spirit of Universal Dependencies are available. Semantic datasets, however, are in short supply, hindering crucial advances in the development of NLP technology in Hebrew. In this work, we present ParaShoot, the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages. We provide the first baseline results using recently-released BERT-style models for Hebrew, showing that there is significant room for improvement on this task.

[1]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[2]  Martin d'Hoffschmidt,et al.  FQuAD: French Question Answering Dataset , 2020, FINDINGS.

[3]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[4]  Hazem M. Hajj,et al.  Neural Arabic Question Answering , 2019, WANLP@ACL 2019.

[5]  Yannick Versley,et al.  Statistical Parsing of Morphologically Rich Languages (SPMRL) What, How and Whither , 2010, SPMRL@NAACL-HLT.

[6]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[8]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[9]  Omer Levy,et al.  Few-Shot Question Answering by Pretraining Span Selection , 2021, ACL/IJCNLP.

[10]  Seungyoung Lim,et al.  KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension , 2019, ArXiv.

[11]  Reut Tsarfaty,et al.  Neural Modeling for Named Entities and Morphology (NEMO2) , 2021, Transactions of the Association for Computational Linguistics.

[12]  Francis M. Tyers,et al.  Universal Dependencies , 2017, EACL.

[13]  Laurent Romary,et al.  A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[14]  Amit Seker,et al.  From SPMRL to NMRL: What Did We Learn (and Unlearn) in a Decade of Parsing Morphologically-Rich Languages (MRLs)? , 2020, ACL.

[15]  I. Yahav,et al.  HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition , 2021, INFORMS Journal on Data Science.

[16]  Amit Seker,et al.  AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With , 2021, ArXiv.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Nizar Habash,et al.  Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages , 2013, SPMRL@EMNLP.