Language identification in web pages

This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing "coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure. Both fast and robust, the software has been in use for the past two years, as part of a crawler for a search engine. Experiments show that it achieves very high accuracy in discriminating different languages on Web pages.

[1]  Peter Henrich Language identification for the automatic grapheme-to-phoneme conversion of foreign words in a German text-to-speech system , 1989, EUROSPEECH.

[2]  Sylvain Delisle,et al.  Text Classification and Multilinguism: Getting at Words via N-grams of Characters , 2002 .

[3]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[4]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[5]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[6]  Douglas-Val Ziegler The automatic identification of languages using linguistic recognition signals , 1992 .

[7]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[8]  Gregory B. Newby,et al.  Information Space Based on HTML Structure , 2000, TREC.

[9]  Javed A. Aslam,et al.  An information-theoretic measure for document similarity , 2003, SIGIR.

[10]  Einat Amitay,et al.  Hypertext: The Importance of being Different , 1997 .

[11]  Massimo Marchiori,et al.  The Limits of Web Metadata, and Beyond , 1998, Comput. Networks.

[12]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[13]  Rafael Dueire Lins,et al.  Automatic language identification of written texts , 2004, SAC '04.

[14]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[15]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[16]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[17]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[18]  Einat Amitay,et al.  Using common hypertext links to identify the best phrasal description of target web documents , 1998 .

[19]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[20]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[21]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[22]  Graeme Hirst,et al.  Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .