Cognition-aware Cognate Detection

Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers’ gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

[1]  John Nerbonne,et al.  Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[2]  Grzegorz Kondrak,et al.  Multiple Word Alignment with Profile Hidden Markov Models , 2009, HLT-NAACL.

[3]  Gholamreza Haffari,et al.  Cognate Identification to improve Phylogenetic trees for Indian Languages , 2019, COMAD/CODS.

[4]  Utilizing Wordnets for Cognate Detection among Indian Languages , 2021, GWC.

[5]  C. Davis,et al.  Bilingual lexical processing: Exploring the cognate/non-cognate distinction , 1992 .

[6]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[7]  Helen Yannakoudakis,et al.  Author Profiling for Abuse Detection , 2018, COLING.

[8]  Micha Elsner,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2014 .

[9]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[10]  E. Bosma,et al.  Cognate facilitation in Frisian-Dutch bilingual children's sentence reading: An eye-tracking study. , 2019, Journal of experimental child psychology.

[11]  Pushpak Bhattacharyya,et al.  Cognitively Inspired Natural Language Processing , 2018, Cognitive Intelligence and Robotics.

[12]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[13]  Liviu P. Dinu,et al.  Automatic Discrimination between Cognates and Borrowings , 2015, ACL.

[14]  Ronald C Petersen,et al.  Assessing the temporal relationship between cognition and gait: slow gait predicts cognitive decline in the Mayo Clinic Study of Aging. , 2013, The journals of gerontology. Series A, Biological sciences and medical sciences.

[15]  Viorica Marian,et al.  Covert Bilingual Language Activation through Cognate Word Processing: An Eye-tracking Study , 2005 .

[16]  Diana Inkpen,et al.  Semi-Supervised Learning of Partial Cognates Using Bilingual Bootstrapping , 2006, ACL.

[17]  Taraka Rama,et al.  Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics? , 2018, NAACL.

[18]  Mitesh M. Khapra,et al.  AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages , 2020, ArXiv.

[19]  Joachim Bingel,et al.  Weakly Supervised Part-of-speech Tagging Using Eye-tracking Data , 2016, ACL.

[20]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[21]  P. Bhattacharyya,et al.  Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour , 2020, AACL.

[22]  Seema Nagar,et al.  Cognition-Cognizant Sentiment Analysis With Multitask Subjectivity Summarization Based on Annotators' Gaze Behavior , 2018, AAAI.

[23]  Grzegorz Kondrak,et al.  Clustering Semantically Equivalent Words into Cognate Sets in Multilingual Lists , 2011, IJCNLP.

[24]  Samar Husain,et al.  Quantifying sentence complexity based on eye-tracking measures , 2016, CL4LC@COLING 2016.

[25]  Richard Evans,et al.  Classifying Referential and Non-referential It Using Gaze , 2020, EMNLP.

[26]  R. Hartsuiker,et al.  Does Bilingualism Change Native-Language Reading? , 2009, Psychological science.

[27]  Pushpak Bhattacharyya,et al.  IndoWordNet , 2010, LREC.

[28]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[29]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[30]  Pushpak Bhattacharyya,et al.  Harnessing Cognitive Features for Sarcasm Detection , 2016, ACL.

[31]  Gerry T. M. Altmann,et al.  Regression-contingent analyses of eye movements during sentence processing: Reply to Rayner and Sereno , 1994 .

[32]  Johannes Dellert,et al.  Combining Information-Weighted Sequence Alignment and Sound Correspondence Models for Improved Cognate Detection , 2018, COLING.

[33]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[34]  Sigrid Klerke,et al.  Improving sentence compression by learning to predict gaze , 2016, NAACL.

[35]  M A Just,et al.  A theory of reading: from eye fixations to comprehension. , 1980, Psychological review.

[36]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[37]  Kenneth Holmqvist,et al.  Eye tracking: a comprehensive guide to methods and measures , 2011 .

[38]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[39]  Gholamreza Haffari,et al.  Challenge Dataset of Cognates and False Friend Pairs from Indian Languages , 2020, LREC.

[40]  Liviu P. Dinu,et al.  Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[41]  Paola Merlo,et al.  Cross-Lingual Word Embeddings and the Structure of the Human Bilingual Lexicon , 2019, CoNLL.

[42]  Taraka Rama Siamese Convolutional Networks for Cognate Identification , 2016, COLING.

[43]  Gerhard Jäger,et al.  Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists , 2017, EACL.

[44]  Anders Søgaard,et al.  Learning to Predict Readability Using Eye-Movement Data From Natives and Learners , 2018, AAAI.

[45]  Joachim Bingel,et al.  Sequence Classification with Human Attention , 2018, CoNLL.

[46]  D. R. McGregor,et al.  Fast approximate string matching , 1988, Softw. Pract. Exp..

[47]  Morteza Rohanian,et al.  Multi-Document Summarization of Persian Text using Paragraph Vectors , 2017, RANLP 2017.

[48]  Viktor Pekar,et al.  Automatic Detection of Orthographics Cues for Cognate Recognition , 2006, LREC.

[49]  Grzegorz Kondrak,et al.  Identifying Cognates by Phonetic and Semantic Similarity , 2001, NAACL.

[50]  Steven Schockaert,et al.  Improving Cross-Lingual Word Embeddings by Meeting in the Middle , 2018, EMNLP.

[51]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[52]  Pushpak Bhattacharyya,et al.  “A Passage to India”: Pre-trained Word Embeddings for Indian Languages , 2020, SLTU.

[53]  Chu-Ren Huang,et al.  Improving Attention Model Based on Cognition Grounded Data for Sentiment Analysis , 2019, IEEE Transactions on Affective Computing.