Language Identification for Text Chats

This work aims to classify the language of typed messages in a text chat system used by language learners. A method for training a language classifier from unlabeled data is presented. A dictionary-based method is used to produce initial classification of the messages. Character based n-gram models of order 3 and 5 are built. A method for selectively choosing the n-grams to be modeled is used to train 15-gram models. This method produces the best-performing classifier. It has models for 57 languages and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.

[1]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[2]  Rafael Dueire Lins,et al.  Automatic language identification of written texts , 2004, SAC '04.

[3]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[4]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Jirí Navrátil,et al.  Recent advances in phonotactic language recognition using binary-decision trees , 2006, INTERSPEECH.

[6]  Steve Renals,et al.  A parallel training algorithm for hierarchical pitman-yor process language models , 2009, INTERSPEECH.

[7]  Ronald Rosenfeld,et al.  A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[8]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[9]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[10]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[11]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[12]  Javier Macías Guarasa,et al.  Language identification based on n-gram frequency ranking , 2007, INTERSPEECH.

[13]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[14]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[15]  Tommi Vatanen,et al.  Language Identification of Short Text Segments with N-gram Models , 2010, LREC.