Combining Information-Weighted Sequence Alignment and Sound Correspondence Models for Improved Cognate Detection

Methods for automated cognate detection in historical linguistics invariably build on some measure of form similarity which is designed to capture the remaining systematic similarities between cognate word forms after thousands of years of divergence. A wide range of clustering and classification algorithms has been explored for the purpose, whereas possible improvements on the level of pairwise form similarity measures have not been the main focus of research. The approach presented in this paper improves on this core component of cognate detection systems by a novel combination of information weighting, a technique for putting less weight on reoccurring morphological material, with sound correspondence modeling by means of pointwise mutual information. In evaluations on expert cognacy judgments over a subset of the IPA-encoded NorthEuraLex database, the combination of both techniques is shown to lead to considerable improvements in average precision for binary cognate detection, and modest improvements for distance-based cognate clustering.

[1]  Peter Turchin Analyzing genetic connections between languages by matching consonant classes , 2010 .

[2]  Johannes Dellert,et al.  Information-theoretic causal inference of lexical flow , 2019 .

[3]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[4]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[5]  Johann-Mattis List,et al.  LexStat: Automatic Detection of Cognates in Multilingual Wordlists , 2012, EACL 2012.

[6]  Taraka Rama Siamese convolutional networks based on phonetic features for cognate identification , 2016, ArXiv.

[7]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[8]  Simon J. Greenhill,et al.  The Potential of Automatic Word Comparison for Historical Linguistics , 2017, PloS one.

[9]  Pavel Sofroniev,et al.  Automatic cognate classification with a Support Vector Machine , 2016, KONVENS.

[10]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[11]  Dan Klein,et al.  Simple Effective Decipherment via Combinatorial Optimization , 2011, EMNLP.

[12]  Dan Klein,et al.  Automated reconstruction of ancient languages using probabilistic models of sound change , 2013, Proceedings of the National Academy of Sciences.

[13]  Johannes Dellert,et al.  A new approach to concept basicness and stability as a window to the robustness of concept list rankings , 2018, Language Dynamics and Change.

[14]  Johann-Mattis List,et al.  LingPy. A Python Library for Quantitative Tasks in Historical Linguistics. Version 2.6.1 , 2017 .

[15]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[16]  Taraka Rama,et al.  Fast and unsupervised methods for multilingual cognate clustering , 2017, ArXiv.

[17]  Ilia Peiros,et al.  Analyzing genetic connections between languages by matching consonant classes 1 , 2010 .

[18]  Russell D. Gray,et al.  Language trees support the express-train sequence of Austronesian expansion , 2000, Nature.

[19]  Taraka Rama Automatic cognate identification with gap-weighted string subsequences , 2015, HLT-NAACL.