Language identification from small text samples*

Abstract There is an increasing need to deal with multi-lingual documents today. If we could segment multi-lingual documents language-wise, it would be very useful both for exploration of linguistic phenomena, such as code-switching and code mixing, and for computational processing of each segment as appropriate. Identification of language from a given small piece of text is therefore an important problem. This paper is about language identification from small text samples. In this paper, language identification is formulated as a generic machine learning problem – a supervised classification task in which features extracted from a training corpus are used for classification. Regression is a well established technique for modelling and analysis. Regression can also be used for classification. This paper gives a clear formulation of multiple linear regression for solving a two-class classification problem. Theoretical bases for verifying the adequacy of the model for the task and for analysing the significance of individual features is included. The method has been applied to pair wise language identification among several major Indian languages including Hindi, Bengali, Marathi, Punjabi, Oriya, Telugu, Tamil, Malayalam and Kannada. Some of these languages belong to the Indo-Aryan family while the others come from the Dravidian family of languages. Language identification was so far a largely unexplored problem in the Indian context. Variations within and across language families have been explored. Variations with regard to sizes of test samples have also been explored. Performance is comparable to the best published results for other languages of the world. In most of the published work in language identification so far, bytes have been taken as the fundamental units of text. Indian scripts are primarily syllabic in nature, reflecting phonetic sound units in a more or less direct fashion. The fundamental units of writing are called aksharas. One of the unique characteristics of Indian scripts is the concept of a script grammar. The script grammar, included in this paper, defines the set of valid aksharas. We hypothesize that aksharas are the more appropriate units of text in Indian languages, not characters or bytes. Our experimental results on language identification support this claim.

[1]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[2]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[3]  S. Glantz Primer of applied regression and analysis of variance / Stanton A. Glantz, Bryan K. Slinker , 1990 .

[4]  Julie Carson-Berndsen,et al.  Automatic Acquisition of Feature-Based Phonotactic Resources , 2004, SIGMORPHON@ACL.

[5]  Peter Schäuble,et al.  Multl-Language Text Indexing for Internet Retrieval , 1997, RIAO.

[6]  Elizabeth C. Botha,et al.  Automatic language identification : resisting complexity , 2001, South Afr. Comput. J..

[7]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[8]  P. Allison Multiple Regression: A Primer , 1994 .

[9]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[10]  S. Glantz,et al.  Primer of Applied Regression & Analysis of Variance , 1990 .

[11]  Philip Resnik,et al.  A Language Identification Application Built on the Java Client / Server Platform , 1997 .

[12]  R.M.K. Sinha Computer Processing of Indian Languages and Scripts—Potentialities & Problems , 1984 .

[13]  Emmanuel Giguet Multilingual Sentence Categorization according to Language , 1995, ArXiv.

[14]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[15]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[16]  David Lindley,et al.  Introduction to the Practice of Statistics , 1990, The Mathematical Gazette.

[17]  Gary Simons,et al.  Language identification and IT Addressing problems of linguistic diversity on a global scale , 2000 .

[18]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[19]  Emmanuel GiguetGREYC,et al.  Categorization according to Language : A step toward combiningLinguistic Knowledge and Statistic Learning , 2007 .

[20]  Clinton O. Longenecker,et al.  Causes and Consequences of Stress in the it Profession , 1999, Inf. Syst. Manag..