Two-Step Classification using Recasted Data for Low Resource Settings

An NLP model’s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create four NLI datasets from existing four text classification datasets in Hindi language. Through experiments, we show that our recasted dataset1 is devoid of statistical irregularities and spurious patterns. We study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. Furthermore, we propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the classification performance by jointly training the classification and textual entailment tasks together. We therefore highlight the benefits of data recasting and our approach 2 with supporting experimental results.

[1]  Peter Clark,et al.  SciTaiL: A Textual Entailment Dataset from Science Question Answering , 2018, AAAI.

[2]  Dan I. Moldovan,et al.  A Semantic Approach to Recognizing Textual Entailment , 2005, HLT.

[3]  Monojit Choudhury,et al.  A New Dataset for Natural Language Inference from Code-mixed Conversations , 2020, CALCS.

[4]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[5]  Ivan Vulic,et al.  Unsupervised Cross-Lingual Representation Learning , 2019, ACL.

[6]  Pushpak Bhattacharyya,et al.  A Hybrid Deep Learning Architecture for Sentiment Analysis , 2016, COLING.

[7]  Rajiv Ratn Shah,et al.  BHAAV- A Text Corpus for Emotion Analysis from Hindi Stories , 2019, ArXiv.

[8]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[9]  Dan Roth,et al.  “Ask Not What Textual Entailment Can Do for You...” , 2010, ACL.

[10]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[11]  Monojit Choudhury,et al.  The State and Fate of Linguistic Diversity and Inclusion in the NLP World , 2020, ACL.

[12]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[13]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[14]  Pengfei Zhang,et al.  Attention Boosted Sequential Inference Model , 2018, ArXiv.

[15]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[16]  Ramit Sawhney,et al.  Detecting Offensive Tweets in Hindi-English Code-Switched Language , 2018, SocialNLP@ACL.

[17]  Aaron C. Courville,et al.  Investigating Biases in Textual Entailment Datasets , 2019, ArXiv.

[18]  Junyi Jessy Li,et al.  An Annotated Dataset of Discourse Modes in Hindi Stories , 2020, LREC.

[19]  Yi Zhang,et al.  Recognizing Textual Relatedness with Predicate-Argument Structures , 2009, EMNLP.

[20]  Raviraj Joshi,et al.  Deep Learning for Hindi Text Classification: A Comparison , 2019, IHCI.

[21]  Alan W. Black,et al.  A Survey of Code-switched Speech and Language Processing , 2019, ArXiv.

[22]  Kevin Duh,et al.  Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework , 2017, IJCNLP.

[23]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[24]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[25]  Pushpak Bhattacharyya,et al.  Natural Language Processing : A Perspective from Computation in Presence of Ambiguity , Resource Constraint and Multilinguality , 2012 .

[26]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[27]  Vivek Srikumar,et al.  A Logic-Driven Framework for Consistency of Neural Models , 2019, EMNLP.

[28]  Ponnurangam Kumaraguru,et al.  Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets , 2018, ACL.