论文信息 - Language Identification for Text Chats

Language Identification for Text Chats

This work aims to classify the language of typed messages in a text chat system used by language learners. A method for training a language classifier from unlabeled data is presented. A dictionary-based method is used to produce initial classification of the messages. Character based n-gram models of order 3 and 5 are built. A method for selectively choosing the n-grams to be modeled is used to train 15-gram models. This method produces the best-performing classifier. It has models for 57 languages and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.

Vesa Siivola | Bryan L. Pellom | Meagan Sills

[1] Timothy Baldwin,et al. Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[2] Rafael Dueire Lins,et al. Automatic language identification of written texts , 2004, SAC '04.

[3] James Mayfield,et al. Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[4] Teemu Hirsimäki,et al. On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Jirí Navrátil,et al. Recent advances in phonotactic language recognition using binary-decision trees , 2006, INTERSPEECH.

[6] Steve Renals,et al. A parallel training algorithm for hierarchical pitman-yor process language models , 2009, INTERSPEECH.

[7] Ronald Rosenfeld,et al. A survey of smoothing techniques for ME models , 2000, IEEE Trans. Speech Audio Process..

[8] William John Teahan,et al. Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[9] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[10] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[11] Andreas Stolcke,et al. Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[12] Javier Macías Guarasa,et al. Language identification based on n-gram frequency ranking , 2007, INTERSPEECH.

[13] Nello Cristianini,et al. Classification using String Kernels , 2000 .

[14] Kenneth R. Beesley,et al. Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[15] Tommi Vatanen,et al. Language Identification of Short Text Segments with N-gram Models , 2010, LREC.