Cognate Identification to improve Phylogenetic trees for Indian Languages

Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., "Tatasama" and "Tadbhava" words.

[1]  A Pranav Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection , 2018, ACL.

[2]  Vasudeva Varma,et al.  Approximate String Matching Techniques for Effective CLIR Among Indian Languages , 2007, WILF.

[3]  Liviu P. Dinu,et al.  Predicting Romanian Stress Assignment , 2014, EACL.

[4]  Kalervo Järvelin,et al.  s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[5]  Lars Borin,et al.  Comparative Evaluation of String Similarity Measures for Automatic Language Classification , 2015, Sequences in Language and Text.

[6]  Liviu P. Dinu,et al.  Automatic Discrimination between Cognates and Borrowings , 2015, ACL.

[7]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[8]  Grzegorz Kondrak,et al.  Multiple Word Alignment with Profile Hidden Markov Models , 2009, HLT-NAACL.

[9]  Liviu P. Dinu,et al.  Automatic Detection of Cognates Using Orthographic Alignment , 2014, ACL.

[10]  Ronald C Petersen,et al.  Assessing the temporal relationship between cognition and gait: slow gait predicts cognitive decline in the Mayo Clinic Study of Aging. , 2013, The journals of gerontology. Series A, Biological sciences and medical sciences.

[11]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[12]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[13]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[14]  J. Wiebe Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference , 2000 .

[15]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.