Explorations in automated language classification

An earlier paper, to which some authors of the present paper have contributed (Brown et al. 2008), describes a method for automating language classification based on the 100-item referent list of Swadesh (1955). Here we discuss a refinement of the method, involving calculation of relative stabilities of list items and reduction of the list to a shorter one by eliminating least stable items. The result is a 40-item referent list. The method for determining stabilities is explained, as well as a method for comparing the classificatory performance of different-sized reduced lists with that of the full 100-item list. A statistical investigation of the relationship of lexical similarity of languages to their geographical proximity is presented. Finally, we test the possibility that information involving typological features of languages can be combined with lexical data to enhance classificatory accuracy.

[1]  Cecil H. Brown,et al.  Automated classification of the world′s languages: a description of the method and preliminary results , 2008 .

[2]  Sheila Embleton,et al.  Statistics in historical linguistics , 1986 .

[3]  S. Levinson,et al.  Structural Phylogenetics and the Reconstruction of Ancient Language History , 2005, Science.

[4]  Feng Wang,et al.  Basic Words and Language Evolution , 2004 .

[5]  William S-Y. Wang,et al.  Spatial distance and lexical replacement , 1986 .

[6]  Michael Cysouw,et al.  Analyzing feature consistency using dissimilarity matrices , 2008 .

[7]  David Gil,et al.  The World Atlas of Language Structures , 2005 .

[8]  Hans Goebl,et al.  Dialektometrische Studien. anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF. 3 Bände , 1984 .

[9]  Daniel H. Huson,et al.  SplitsTree-a program for analyzing and visualizing evolutionary data , 1997 .

[10]  M. Swadesh Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[11]  Dietrich Stauffer,et al.  On the relation between structural diversity and geographical distance among languages: Observations and computer simulations , 2006, physics/0607031.

[12]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[13]  Brett Kessler,et al.  Book Reviews: The Significance of Word Lists , 2001, CL.

[14]  J. Dufrénoy La relation entre la distance spatiale et la distance lexicale , 1972 .

[15]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[16]  Paul Black,et al.  SOME RESULTS FROM THE VOCABULARY METHOD OF RECONSTRUCTING LANGUAGE TREES , 1973 .

[17]  Søren Wichmann,et al.  How to use typological databases in historical linguistic research , 2007 .

[18]  M. Swadesh Salish Internal Relationships , 1950, International Journal of American Linguistics.

[19]  April M. S. McMahon,et al.  Language classification by numbers , 2005 .

[20]  Daniel H. Huson,et al.  SplitsTree: analyzing and visualizing evolutionary data , 1998, Bioinform..

[21]  A. L. Kroeber Yokuts dialect survey , 1963 .