论文信息 - Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

We present a novel method for discovering and modeling the relationship between informal Chinese expressions (including colloquialisms and instant-messaging slang) and their formal equivalents. Specifically, we proposed a bootstrapping procedure to identify a list of candidate informal phrases in web corpora. Given an informal phrase, we retrieve contextual instances from the web using a search engine, generate hypotheses of formal equivalents via this data, and rank the hypotheses using a conditional log-linear model. In the log-linear model, we incorporate as feature functions both rule-based intuitions and data co-occurrence phenomena (either as an explicit or indirect definition, or through formal/informal usages occurring in free variation in a discourse). We test our system on manually collected test examples, and find that the (formal-informal) relationship discovery and extraction process using our method achieves an average 1-best precision of 62%. Given the ubiquity of informal conversational style on the internet, this work has clear applications for text normalization in text-processing systems including machine translation aspiring to broad coverage.

David Yarowsky | Zhifei Li

[1] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[2] David Yarowsky,et al. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora , 2008, ACL.

[3] Sanjeev Khudanpur,et al. Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[4] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5] Hiu-wing Doris. Lee,et al. A study of automatic expansion of Chinese abbreviations , 2005 .

[6] Jian Su,et al. A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[7] Brian Roark,et al. Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[8] Hsin-Hsi Chen,et al. Applying Repair Processing in Chinese Homophone Disambiguation , 1997, ANLP.

[9] Jing-Shin Chang,et al. Mining Atomic Chinese Abbreviation Pairs: A Probabilistic Model for Single Character Word Recovery , 2006, SIGHAN@COLING/ACL.

[10] Rob Malouf,et al. A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[11] Jason S. Chang,et al. Learning to Find English to Chinese Transliterations on the Web , 2007, EMNLP-CoNLL.

[12] Youngja Park,et al. Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[13] Lois Curfman McInnes,et al. TAO users manual. , 2003 .

[14] David Yarowsky,et al. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[15] Jing-Shin Chang,et al. A Preliminary Study on Probabilistic Models for Chinese Abbreviations , 2004, SIGHAN@ACL.

[16] Serguei V. S. Pakhomov. Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.