Statistical Language Identification of Short Texts

Although correctly identifying the language of short texts should prove useful in a large number of applications, few satisfactory attemps are reported in the literature. In this paper we describe a Naive Bayes Classifier that performs well on very short texts, as well as the corpus that we created from movie subtitles for training it. Both the corpus and the algorithm are available under the GNU Lesser General Public License.

[1]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[3]  David Elworthy Language Identification With Confidence Limits , 1998, VLC@COLING/ACL.

[4]  Padraig Cunningham,et al.  Neural Networks for Language Identification: A Comparative Study , 1998, Inf. Process. Manag..

[5]  Dat Tran,et al.  VQ-based written language identification , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[6]  Sung-Hyuk Cha,et al.  Language Identification from Text Using N-gram Based Cumulative Frequency Addition , 2004 .

[7]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.