Orthographic measures of language distances between the official South African languages

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.

[1]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[4]  G Botha,et al.  Two approaches to gathering text corpora from the WorldWideWeb , 2005 .

[5]  J. Kruskal An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[6]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[7]  Kavi Narayana Murthy,et al.  Language identification from small text samples* , 2006, J. Quant. Linguistics.

[8]  Brett Kessler,et al.  Computational dialectology in Irish Gaelic , 1995, EACL.

[9]  Wilbert Heeringa,et al.  Norwegian Dialects Examined Perceptually and Acoustically , 2003, Comput. Humanit..

[10]  W. Heeringa,et al.  Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data , 2004, Language Variation and Change.

[11]  Wilbert Heeringa,et al.  Intuitions on linguistic distance : geographically or linguistically based ? * , 2006 .

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  John Nerbonne,et al.  Phonetic Distance between Dutch Dialects , 1996 .

[14]  Emmanuel GiguetGREYC,et al.  Categorization according to Language : A step toward combiningLinguistic Knowledge and Statistic Learning , 2007 .

[15]  John P. Hutchison,et al.  African Languages: An Introduction , 2000 .

[16]  Lluís Padró,et al.  Comparing methods for language identification , 2004, Proces. del Leng. Natural.

[17]  Wilbert Heeringa,et al.  Intuitions on linguistic distance: geographically or lexically based? , 2006 .

[18]  Harrie Mazeland Artikelen van de Vijfde sociolinguistische conferentie in Lunteren , 2006 .