Levenshtein Distances Fail to Identify Language Relationships Accurately

The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of time. Standardizing the orthography increases the performance, but only to a maximum of 65% accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages.This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks.

[1]  Otto Dempwolff,et al.  Vergleichende Lautlehre des austronesischen Wortschatzes , 1938 .

[2]  Darrell T. Tryon,et al.  Solomon Islands languages : an internal classification , 1983 .

[3]  W. Heeringa,et al.  Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data , 2004, Language Variation and Change.

[4]  W. Hennig Phylogenetic Systematics , 2002 .

[5]  Knut Bergsland,et al.  On the Validity of Glottochronology , 1962, Current Anthropology.

[6]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[7]  Simon J. Greenhill,et al.  Rise and fall of political complexity in island South-East Asia and the Pacific , 2010, Nature.

[8]  Eric W. Holman,et al.  Population Size and Rates of Language Change , 2009, Human biology.

[9]  Robert Blust,et al.  The Austronesian Languages , 2009 .

[10]  W. Heeringa,et al.  Evaluation of String Distance Algorithms for Dialectology , 2006 .

[11]  Grzegorz Kondrak,et al.  Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification , 2006 .

[12]  Simon J. Greenhill,et al.  How Accurate and Robust Are the Phylogenetic Estimates of Austronesian Language Relationships? , 2010, PloS one.

[13]  Jeff Mielke Modeling Distinctive Feature Emergence , 2005 .

[14]  Charles O. Frake,et al.  Philippine minor languages;: Word lists and phonologies, , 1972, The Journal of Asian Studies.

[15]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  P. Meriggi Dempwolff Otto. Vergleichende Lautlehre des austronesischen Wortschatzes. I. Band: Induktiver Aufbau einer indonesischen Ursprache. (15. Beiheft zur ZES.) Berlin, D. Reimer 1934. 124 S. 8° , 1937 .

[17]  Vittorio Loreto,et al.  A stochastic local search approach to language tree reconstruction , 2010 .

[18]  Wilbert Heeringa,et al.  Measuring Dialect Differences , 2009 .

[19]  Simon J. Greenhill,et al.  Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement , 2009, Science.

[20]  M. A. STEEL,et al.  Loss of information in genetic distances , 1988, Nature.

[21]  Mark Durie,et al.  The comparative method reviewed : regularity and irregularity in language change , 1997 .

[22]  Jelena Prokić,et al.  Investigating the relatedness of the endangered Dogon languages , 2013, Lit. Linguistic Comput..

[23]  Dan Klein,et al.  Improved Reconstruction of Protolanguage Word Forms , 2009, NAACL.

[24]  Dik Bakker,et al.  Glottochronology as a Heuristic for Genealogical Language Relationships , 2010, J. Quant. Linguistics.

[25]  Cecil H. Brown,et al.  Adding typology to lexicostatistics: A combined approach to language classification , 2009 .

[26]  Lyle Campbell,et al.  Ethnologue: Languages of the world (review) , 2008 .

[27]  Søren Wichmann,et al.  Explorations in automated language classification , 2008 .

[28]  Viveka Velupillai,et al.  Homelands of the world’s language families: a quantitative approach , 2010 .

[29]  John Nerbonne,et al.  Preliminary Identification of Language Groups and Loan Words in Central Asia , 2007 .

[30]  M. Serva,et al.  Indo-European languages tree by Levenshtein distance , 2007, 0708.2971.

[31]  Dan Klein,et al.  A Probabilistic Approach to Language Change , 2007, NIPS.

[32]  Robert Blust *t to k: An Austronesian Sound Change Revisited , 2004 .

[33]  Edward Susko,et al.  On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. , 2004, Molecular biology and evolution.

[34]  Cecil H. Brown,et al.  Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[35]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[36]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[37]  Simon J. Greenhill,et al.  Austronesian language phylogenies: myths and misconceptions about Bayesian computational methods , 2009 .

[38]  Eric W. Holman,et al.  Do languages originate and become extinct at constant rates , 2010 .

[39]  Sheila Embleton,et al.  Statistics in historical linguistics , 1986 .

[40]  Andrew Pawley,et al.  The Austronesian dispersal: languages, technologies, people , 2002 .

[41]  T. Mark Ellison,et al.  Bayesian Identification of Cognates and Correspondences , 2007, SIGMORPHON.

[42]  G. W. Grace,et al.  The position of the Polynesian languages within the Austronesian (Malayo-Polynesian) language family , 1959 .

[43]  Simon J. Greenhill,et al.  On the shape and fabric of human history , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[44]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[45]  J. Kruskal An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[46]  Robert Blust,et al.  The Greater Central Philippines Hypothesis , 1991 .

[47]  Malcolm Ross,et al.  Proto Oceanic and the Austronesian languages of Western Melanesia , 1991 .

[48]  B. D. Boer Evolutionary phonology: the emergence of sound patterns , 2006 .

[49]  R. Blust CENTRAL AND CENTRAL- EASTERN MALAYO-POLYNESIAN , 1993 .

[50]  Simon J. Greenhill,et al.  The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics , 2008, Evolutionary bioinformatics online.

[51]  John Nerbonne,et al.  Evaluating the Pairwise String Alignment of Pronunciations , 2009, LaTeCH - SHELT&R@EACL.

[52]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[53]  Isidore Dyen,et al.  THE LEXICOSTATISTICAL CLASSIFICATION OF THE AUSTRONESIAN LANGUAGES. , 1963 .

[54]  Åshild Næss,et al.  An Oceanic Origin for Äiwoo, the Language of the Reef Islands? , 2007 .

[55]  Filippo Petroni,et al.  Language distance and tree reconstruction , 2008 .

[56]  Donald A. Ringe join On Calculating the Factor of Chance in Language Comparison , 1992 .