Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

We present a novel method for discovering and modeling the relationship between informal Chinese expressions (including colloquialisms and instant-messaging slang) and their formal equivalents. Specifically, we proposed a bootstrapping procedure to identify a list of candidate informal phrases in web corpora. Given an informal phrase, we retrieve contextual instances from the web using a search engine, generate hypotheses of formal equivalents via this data, and rank the hypotheses using a conditional log-linear model. In the log-linear model, we incorporate as feature functions both rule-based intuitions and data co-occurrence phenomena (either as an explicit or indirect definition, or through formal/informal usages occurring in free variation in a discourse). We test our system on manually collected test examples, and find that the (formal-informal) relationship discovery and extraction process using our method achieves an average 1-best precision of 62%. Given the ubiquity of informal conversational style on the internet, this work has clear applications for text normalization in text-processing systems including machine translation aspiring to broad coverage.

[1]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[2]  David Yarowsky,et al.  Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora , 2008, ACL.

[3]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[4]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5]  Hiu-wing Doris. Lee,et al.  A study of automatic expansion of Chinese abbreviations , 2005 .

[6]  Jian Su,et al.  A Joint Source-Channel Model for Machine Transliteration , 2004, ACL.

[7]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[8]  Hsin-Hsi Chen,et al.  Applying Repair Processing in Chinese Homophone Disambiguation , 1997, ANLP.

[9]  Jing-Shin Chang,et al.  Mining Atomic Chinese Abbreviation Pairs: A Probabilistic Model for Single Character Word Recovery , 2006, SIGHAN@COLING/ACL.

[10]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[11]  Jason S. Chang,et al.  Learning to Find English to Chinese Transliterations on the Web , 2007, EMNLP-CoNLL.

[12]  Youngja Park,et al.  Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[13]  Lois Curfman McInnes,et al.  TAO users manual. , 2003 .

[14]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[15]  Jing-Shin Chang,et al.  A Preliminary Study on Probabilistic Models for Chinese Abbreviations , 2004, SIGHAN@ACL.

[16]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.