Automated classification of the world′s languages: a description of the method and preliminary results

Abstract An approach to the classification of languages through automated lexical comparison is described. This method produces near-expert classifications. At the core of the approach is the Automated Similarity Judgment Program (ASJP). ASJP is applied to 100-item lists of core vocabulary from 245 globally distributed languages. The output is 29,890 lexical similarity percentages for the same number of paired languages. Percentages are used as a database in a program designed originally for generating phylogenetic trees in biology. This program yields branching structures (ASJP trees) reflecting the lexical similarity of languages. ASJP trees for languages of the sample spoken in Middle America and South America show that the method is capable of grouping together on distinct branches languages of non-controversial genetic groups. In addition, ASJP sub-branching for each of nine respective genetic groups – Mayan, Mixe-Zoque, Otomanguean, Huitotoan-Ocaina, Tacanan, Chocoan, Muskogean, Indo-European, and Austro-Asiatic – agrees substantially with subgrouping for those groups produced by expert historical linguists. ASJP can be applied, among many other uses, to search for possible relationships among languages heretofore not observed or only provisionally recognized. Preliminary ASJP analysis reveals several such possible relationships for languages of Middle America and South America. Expanding the ASJP database to all of the world′s languages for which 100-word lists can be assembled is a realistic goal that could be achieved in a relatively short period of time, maybe one year or even less.

[1]  C. Doke The native languages of South Africa , 1942 .

[2]  M. Haas The Position of Apalachee in the Muskogean Family , 1949, International Journal of American Linguistics.

[3]  M. Swadesh Diffusional Cumulation and Archaic Residue as Historical Explanations , 1951, Southwestern Journal of Anthropology.

[4]  Isidore Dyen,et al.  THE LEXICOSTATISTICAL CLASSIFICATION OF THE AUSTRONESIAN LANGUAGES. , 1963 .

[5]  Marianne Mithun,et al.  The Languages of Native America : historical and comparative assessment , 1979 .

[6]  J. B. M. Guy Glottochronology without cognate recognition , 1980 .

[7]  Marianne Mithun,et al.  The Languages of Native America: Historical and Comparative Assessment , 1982 .

[8]  S. Lamb,et al.  Sprung from some common source : investigations into the prehistory of languages , 1991 .

[9]  Doris L. Payne,et al.  Amazonian Linguistics: Studies in Lowland South American Languages , 1991 .

[10]  Donald A. Ringe join On Calculating the Factor of Chance in Language Comparison , 1992 .

[11]  Donald Arthur Ringe On Calculating the Factor of Chance in Language Comparison , 1992 .

[12]  C. Moseley,et al.  Atlas of the world's languages , 1994 .

[13]  Søren Wichmann,et al.  The relationship among the Mixe-Zoquean languages of Mexico , 1995 .

[14]  Daniel H. Huson,et al.  SplitsTree-a program for analyzing and visualizing evolutionary data , 1997 .

[15]  Ives Goddard,et al.  American Indian Languages: The historical linguistics of native America . By Lyle Campbell , 1999 .

[16]  W. Fitch,et al.  The Origin and Diversification of Language , 1999 .

[17]  Gwang-Yoon Goh Probabilistic Meaning of Multiple Matchings for Language Relationship , 2000, J. Quant. Linguistics.

[18]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[19]  Tandy Warnow,et al.  Indo‐European and Computational Cladistics , 2002 .

[20]  Grzegorz Kondrak,et al.  Phonetic Alignment and Similarity , 2003, Comput. Humanit..

[21]  Grzegorz Kondrak,et al.  Identifying Complex Sound Correspondences in Bilingual Wordlists , 2003, CICLing.

[22]  Cecil H. Brown,et al.  Proto‐Mayan Syllable Nuclei1 , 2004, International Journal of American Linguistics.

[23]  April M. S. McMahon,et al.  Language classification by numbers , 2005 .

[24]  S. Levinson,et al.  Structural Phylogenetics and the Reconstruction of Ancient Language History , 2005, Science.

[25]  Diana Inkpen,et al.  Automatic Identification of Cognates and False Friends in French and English , 2005 .

[26]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[27]  H. V. D. Voort Kwaza in a comparative perspective , 2005 .

[28]  Grzegorz Kondrak,et al.  Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models , 2005, CoNLL.

[29]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[30]  Grzegorz Kondrak,et al.  Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification , 2006 .

[31]  Søren Wichmann,et al.  How to use typological databases in historical linguistic research , 2007 .