DaNetQA: a yes/no Question Answering Dataset for the Russian Language

DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper, we present a reproducible approach to DaNetQA creation and investigate transfer learning methods for task and language transferring. For task transferring we leverage three similar sentence modelling tasks: 1) a corpus of paraphrases, Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3) another question answering task, SberQUAD. For language transferring we use English to Russian translation together with multilingual language fine-tuning.

[1]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[2]  Guillaume Bouchard,et al.  Interpretation of Natural Language Rules in Conversational Machine Reading , 2018, EMNLP.

[3]  Leo Hickey,et al.  The Pragmatics of Translation , 1998 .

[4]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[5]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[6]  RELATION EXTRACTION DATASET FOR THE RUSSIAN , 2020 .

[7]  Young-Bum Kim,et al.  Cross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources , 2017, EMNLP.

[8]  Min Zhang,et al.  Cross-lingual Pre-training Based Transfer for Zero-shot Neural Machine Translation , 2019, AAAI.

[9]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[10]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[11]  Danqi Chen,et al.  Position-aware Attention and Supervised Data Improve Slot Filling , 2017, EMNLP.

[12]  Sergey I. Nikolenko,et al.  Large-Scale Transfer Learning for Natural Language Generation , 2019, ACL.

[13]  Matthias Hagen,et al.  What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries , 2015, CIKM.

[14]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[15]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[16]  Anastasia Kotelnikova,et al.  SentiRusColl: Russian Collocation Lexicon for Sentiment Analysis , 2019, Communications in Computer and Information Science.

[17]  Elena Yagunova,et al.  Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction , 2015, RuSSIR.

[18]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[19]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[20]  Brigitte Grau,et al.  How to Pre-Train Your Model? Comparison of Different Pre-Training Models for Biomedical Question Answering , 2019, PKDD/ECML Workshops.

[21]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[24]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[25]  Dongyan Zhao,et al.  Find a Reasonable Ending for Stories: Does Logic Relation Help the Story Cloze Test? , 2019, AAAI.