论文信息 - Chinese Informal Word Normalization: an Experimental Study - 字舞流文

Chinese Informal Word Normalization: an Experimental Study

We study the linguistic phenomenon of informal words in the domain of Chinese microtext and present a novel method for normalizing Chinese informal words to their formal equivalents. We formalize the task as a classification problem and propose rule-based and statistical features to model three plausible channels that explain the connection between formal and informal pairs. Our two-stage selection-classification model is evaluated on a crowdsourced corpus and achieves a normalization precision of 89.5% across the different channels, significantly improving the state-of-the-art.

Takashi Onishi | Min-Yen Kan | Kai Ishikawa | Aobo Wang | Daniel Andrade

[1] David Yarowsky,et al. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora , 2008, ACL.

[2] Kam-Fai Wong,et al. Normalization of Chinese chat language , 2008, Lang. Resour. Evaluation.

[3] Tao Chen,et al. Re-tweeting from a linguistic perspective , 2012 .

[4] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[5] Wei Gao,et al. NIL Is Not Nothing: Recognition of Chinese Network Informal Language Expressions , 2005, IJCNLP.

[6] Yiqun Liu,et al. PrEV: Preservation Explorer and Vault for Web 2.0 User-Generated Content , 2012, TPDL.

[7] Sergei Nirenburg,et al. A Statistical Approach to Machine Translation , 2003 .

[8] Timothy Baldwin,et al. Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[9] Haizhou Li,et al. Machine Transliteration: Leveraging on Third Languages , 2010, COLING.

[10] Min-Yen Kan,et al. Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[11] David Yarowsky,et al. Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora , 2008, EMNLP.

[12] Youngja Park,et al. Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[13] Grzegorz Kondrak,et al. How do you pronounce your name? Improving G2P with transliterations , 2011, ACL.

[14] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[15] Min-Yen Kan,et al. Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation , 2013, ACL.

[16] Serguei V. S. Pakhomov. Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[17] Dan Klein,et al. Faster and Smaller N-Gram Language Models , 2011, ACL.

[18] Kam-Fai Wong,et al. A Phonetic-Based Approach to Chinese Chat Text Normalization , 2006, ACL.

[19] E. Hovy,et al. Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[20] Jason S. Chang,et al. Learning to Find English to Chinese Transliterations on the Web , 2007, EMNLP-CoNLL.

[21] Jing-Shin Chang,et al. Mining atomic Chinese abbreviations with a probabilistic single character recovery model , 2007, Lang. Resour. Evaluation.