Identifying Language Origin of Named Entity With Multiple Information Sources

To identify the language origin of a named entity, morphological information associated with its letter spelling, such as letter N-grams, is commonly employed. However, with this information only, named entities with similar spellings but from different language origins are difficult to differentiate. In this paper, a measure of "popularity," in terms of frequency or page count of the named entity in language-specific Web search, is proposed for identifying its language origin. Morphological information, including letter or letter-chunk N-grams, is used to enhance the performance of language identification in conjunction with Web-based page counts. Six languages, including English, German, French, Portuguese, Chinese, and Japanese (Chinese and Japanese named entities are shown in their corresponding phonetic alphabets, i.e., Pinyin and Romaji), are tested. Experiments show that when classifying four Latin languages, including English, German, French, and Portuguese, which are written in Latin alphabets, features from different information sources yield substantial performance improvements in the classification accuracy over a letter 4-gram-based baseline system. The accuracy increases from 75.0% to 86.3%, or a 45.2% relative error reduction.

[1]  Gregory Grefenstette,et al.  Mining the Web to Create a Language Model for Mapping between English Names and Phrases and Japanese , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[3]  Dietrich Klakow Language-model optimization by mapping of corpora , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Yong Zhao,et al.  Identifying Language Origin of Person Names With N-Grams of Different Units , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Beatrice Alex,et al.  Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language , 2006 .

[7]  Gregory Grefenstette,et al.  Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[8]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.

[9]  Claire Waast-Richard,et al.  A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis , 2005, INTERSPEECH.

[10]  Berlin Chen,et al.  Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[11]  Katie McGrath,et al.  Language Identification and Language Specific Letter-to-Sound Rules , 2004 .

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[14]  Jilei Tian,et al.  On text-based language identification for multilingual speech recognition systems , 2002, INTERSPEECH.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Harald Romsdorfer,et al.  Text analysis and language identification for polyglot text-to-speech synthesis , 2007, Speech Commun..

[17]  Masatoshi Yoshikawa,et al.  Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.

[18]  Benoît Maison,et al.  Using place name data to train language identification models , 2003, INTERSPEECH.

[19]  Beatrice Alex,et al.  An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[20]  Gregory Grefenstette,et al.  Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation , 2004, ACL.

[21]  Xuedong Huang,et al.  Improvements on a trainable letter-to-sound converter , 1997, EUROSPEECH.

[22]  Paulseph-John Farrugia,et al.  Text to Speech Technologies for Mobile Telephony Services , 2003 .

[23]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[24]  Venkatesan Guruswami,et al.  Multiclass learning, boosting, and error-correcting codes , 1999, COLT '99.

[25]  Fei Huang Cluster-specific Named Entity Transliteration , 2005, HLT/EMNLP.

[26]  Pu-Jen Cheng,et al.  Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora , 2004, ACL.

[27]  Ariadna Font Llitjós,et al.  Knowledge of language origin improves pronunciation accuracy of proper names , 2001, INTERSPEECH.

[28]  Benoit Maison,et al.  Pronunciation modeling for names of foreign origin , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).