论文信息 - Disentangling from Babylonian Confusion - Unsupervised Language Identification

Disentangling from Babylonian Confusion - Unsupervised Language Identification

This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.

Christian Biemann | Sven Teresniak

[1] U. Quasthoff,et al. The Poisson Collocation Measure and its Applications , 2002 .

[2] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[3] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[4] Ramon Ferrer i Cancho,et al. The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[5] Christian Biemann,et al. Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems , 2004, IICS.

[6] Georg Rehm,et al. Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[7] A. Barabasi,et al. Scale-free characteristics of random networks: the topology of the world-wide web , 2000 .

[8] Eduard Hovy,et al. Towards terascale knowledge acquisition , 2004, COLING 2004.

[9] Georg Rehm. Towards Automatic Web Genre Identification , 2002, HICSS.

[10] G. Zipf,et al. Relative Frequency as a Determinant of Phonetic Change , 1930 .

[11] Christian Biemann,et al. Language-Independent Methods for Compiling Monolingual Lexical Data , 2004, CICLing.

[12] Patrick Pantel,et al. Towards Terascale Semantic Acquisition , 2004, COLING.