Can Corpus Based Measures be Used for Comparative Study of Languages?

Quantitative measurement of inter-language distance is a useful technique for studying diachronic and synchronic relations between languages. Such measures have been used successfully for purposes like deriving language taxonomies and language reconstruction, but they have mostly been applied to handcrafted word lists. Can we instead use corpus based measures for comparative study of languages? In this paper we try to answer this question. We use three corpus based measures and present the results obtained from them and show how these results relate to linguistic and historical knowledge. We argue that the answer is yes and that such studies can provide or validate linguistic and computational insights.

[1]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[2]  John Nerbonne,et al.  Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[3]  Duansheng Chen,et al.  A novel approach to detect and correct highlighted face region in color image , 2003, Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003..

[4]  李幼升,et al.  Ph , 1989 .

[5]  Grzegorz Kondrak,et al.  Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification , 2006 .

[6]  Liu Zheng-kai,et al.  A novel approach to detect and correct highlighted face region in color image , 2003 .

[7]  April M. S. McMahon,et al.  Language classification by numbers , 2005 .

[8]  R. Port,et al.  Against Formal Phonology , 2005 .

[9]  Anil Kumar Singh A Computational Phonetic Model for Indian Language Scripts , 2006 .

[10]  T. Warnow,et al.  Perfect Phylogenetic Networks: A New Methodology for Reconstructing the Evolutionary History of Natural Languages , 2005 .

[11]  Philip Resnik,et al.  A Language Identification Application Built on the Java Client / Server Platform , 1997 .

[12]  Anil Kumar Singh Study of Some Distance Measures for Language and Encoding Identification , 2006 .

[13]  J. Kruskal,et al.  An Indoeuropean classification : a lexicostatistical experiment , 1992 .

[14]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[15]  W. Heeringa,et al.  Evaluation of String Distance Algorithms for Dialectology , 2006 .

[16]  Simon Kirby,et al.  Measuring Language Divergence by Intra-Lexical Comparison , 2006, ACL.

[17]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .