Chinese Informal Word Normalization: an Experimental Study

We study the linguistic phenomenon of informal words in the domain of Chinese microtext and present a novel method for normalizing Chinese informal words to their formal equivalents. We formalize the task as a classification problem and propose rule-based and statistical features to model three plausible channels that explain the connection between formal and informal pairs. Our two-stage selection-classification model is evaluated on a crowdsourced corpus and achieves a normalization precision of 89.5% across the different channels, significantly improving the state-of-the-art.

[1]  David Yarowsky,et al.  Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora , 2008, ACL.

[2]  Kam-Fai Wong,et al.  Normalization of Chinese chat language , 2008, Lang. Resour. Evaluation.

[3]  Tao Chen,et al.  Re-tweeting from a linguistic perspective , 2012 .

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Wei Gao,et al.  NIL Is Not Nothing: Recognition of Chinese Network Informal Language Expressions , 2005, IJCNLP.

[6]  Yiqun Liu,et al.  PrEV: Preservation Explorer and Vault for Web 2.0 User-Generated Content , 2012, TPDL.

[7]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .

[8]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[9]  Haizhou Li,et al.  Machine Transliteration: Leveraging on Third Languages , 2010, COLING.

[10]  Min-Yen Kan,et al.  Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[11]  David Yarowsky,et al.  Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora , 2008, EMNLP.

[12]  Youngja Park,et al.  Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[13]  Grzegorz Kondrak,et al.  How do you pronounce your name? Improving G2P with transliterations , 2011, ACL.

[14]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[15]  Min-Yen Kan,et al.  Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation , 2013, ACL.

[16]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[17]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[18]  Kam-Fai Wong,et al.  A Phonetic-Based Approach to Chinese Chat Text Normalization , 2006, ACL.

[19]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[20]  Jason S. Chang,et al.  Learning to Find English to Chinese Transliterations on the Web , 2007, EMNLP-CoNLL.

[21]  Jing-Shin Chang,et al.  Mining atomic Chinese abbreviations with a probabilistic single character recovery model , 2007, Lang. Resour. Evaluation.