CogNet: A Large-Scale Cognate Database

This paper introduces CogNet, a new, large-scale lexical database that provides cognates -words of common origin and meaning- across languages. The database currently contains 3.1 million cognate pairs across 338 languages using 35 writing systems. The paper also describes the automated method by which cognates were computed from publicly available wordnets, with an accuracy evaluated to 94%. Finally, it presents statistics about the cognate data and some initial insights into it, hinting at a possible future exploitation of the resource by various fields of lingustics.

[1]  Grzegorz Kondrak,et al.  Identifying Cognate Sets Across Dictionaries of Related Languages , 2017, EMNLP.

[2]  Kevin Knight,et al.  Out-of-the-box Universal Romanization Tool uroman , 2018, ACL.

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Iryna Gurevych,et al.  Cognate Production using Character-based Machine Translation , 2013, IJCNLP.

[5]  Francis Bond,et al.  Linking and Extending an Open Multilingual Wordnet , 2013, ACL.

[6]  Yulia Tsvetkov,et al.  Lexicon Stratification for Translating Out-of-Vocabulary Words , 2015, ACL.

[7]  Grzegorz Kondrak,et al.  Clustering Semantically Equivalent Words into Cognate Sets in Multilingual Lists , 2011, IJCNLP.

[8]  Taraka Rama,et al.  Fast and unsupervised methods for multilingual cognate clustering , 2017, ArXiv.

[9]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[10]  Gerhard Jäger,et al.  Phylogenetic Inference from Word Lists Using Weighted Alignment with Empirically Determined Weights , 2013 .

[11]  Daniel Marcu,et al.  Cognates Can Improve Statistical Translation Models , 2003, NAACL.

[12]  Falk Scholer,et al.  Machine transliteration survey , 2011, ACM Comput. Surv..

[13]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[14]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[15]  Taraka Rama,et al.  Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics? , 2018, NAACL.

[16]  Fausto Giunchiglia,et al.  Understanding and Exploiting Language Diversity , 2017, IJCAI.

[17]  Jordan L. Boyd-Graber,et al.  Adding dense, weighted connections to WordNet , 2005 .

[18]  Gerhard Jäger,et al.  Global-scale phylogenetic linguistic inference from lexical resources , 2018, Scientific Data.

[19]  Fausto Giunchiglia,et al.  Language and domain aware lightweight ontology matching , 2017, J. Web Semant..

[20]  Ian Maddieson,et al.  Studying language evolution in the age of big data , 2018, Journal of Language Evolution.

[21]  David Yarowsky,et al.  Creating Large-Scale Multilingual Cognate Tables , 2018, LREC.

[22]  Fausto Giunchiglia,et al.  Domain-Based Sense Disambiguation in Multilingual Structured Data , 2016 .

[23]  Gerard de Melo Etymological Wordnet: Tracing The History of Words , 2014, LREC.