An N-Gram-and-Wikipedia joint approach to Natural Language Identification

Natural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current non-computational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand large-scale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a language-specific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents.

[1]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[2]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[3]  Thomas Mandl,et al.  Barriers to Information Access across Languages on the Internet: Network and Language Effects , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[4]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[5]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[6]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[7]  Gilad Mishne,et al.  Using Wikipedia at the TREC QA Track , 2004, TREC.

[8]  Thomas Mandl,et al.  Language Identification in Multi-lingual Web-Documents , 2006, NLDB.

[9]  Radim Rehurek,et al.  Language Identification on the Web: Extending the Dictionary Method , 2009, CICLing.

[10]  Michael Elhadad,et al.  Using Wikipedia Links to Construct Word Segmentation Corpora , 2008 .

[11]  Viviana Mascardi,et al.  Statistical Language Identification of Short Texts , 2011, ICAART.

[12]  M. de Rijke,et al.  Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.

[13]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[14]  M. de Rijke,et al.  Blueprint of a Cross-Lingual Web Retrieval Collection , 2005, J. Digit. Inf. Manag..

[15]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.