The textcat Package for n-Gram Based Text Categorization in R

Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful. This paper presents the R extension package textcat for n-gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n-gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identification methods.

[1]  Sung-Hyuk Cha,et al.  Language Identification from Text Using N-gram Based Cumulative Frequency Addition , 2004 .

[2]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[3]  Anil Kumar Singh Study of Some Distance Measures for Language and Encoding Identification , 2006 .

[4]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[5]  N. Mikelic,et al.  Language Indentification: How to Distinguish Similar Languages? , 2007, 2007 29th International Conference on Information Technology Interfaces.

[6]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[7]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[8]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[9]  Peter Henrich Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a German text-to-speech system , 1989, EUROSPEECH.

[10]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[11]  Laila Khreisat,et al.  A machine learning approach for Arabic text classification using N-gram frequency statistics , 2009, J. Informetrics.

[12]  David McKelvie,et al.  Data in Your Language: the Eci Multilingual Corpus 1 , 2007 .

[13]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[14]  Leo Egghe,et al.  The Distribution of N-Grams , 2000, Scientometrics.

[15]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[16]  È ü ½ Ü ¾ Ü,et al.  Probabilistic Language Modelling , 2002 .

[17]  Philip Hanna,et al.  Extending Zipf’s law to n-grams for large corpora , 2009, Artificial Intelligence Review.