Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

We address the problem of informal word recognition in Chinese microblogs. A key problem is the lack of word delimiters in Chinese. We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation, we propose to model the two tasks jointly. Our joint inference method significantly outperforms baseline systems that conduct the tasks individually or sequentially.

[1]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[2]  J. Jenkins,et al.  Word association norms , 1964 .

[3]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[4]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Wei Gao,et al.  NIL Is Not Nothing: Recognition of Chinese Network Informal Language Expressions , 2005, IJCNLP.

[7]  Yiqun Liu,et al.  PrEV: Preservation Explorer and Vault for Web 2.0 User-Generated Content , 2012, TPDL.

[8]  Hai Zhao,et al.  Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling , 2006, PACLIC.

[9]  Xu Sun,et al.  Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection , 2012, ACL.

[10]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[11]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[12]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.

[13]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[14]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[15]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[16]  Edie Rasmussen,et al.  Theory and Practice of Digital Libraries , 2012, Lecture Notes in Computer Science.

[17]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[18]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[19]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[20]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[21]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[22]  Kam-Fai Wong,et al.  Anomaly Detecting within Dynamic Chinese Chat Text , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[23]  Ming-Hong Bai,et al.  Unknown Word Detection for Chinese by a Corpus-based Learning Method , 1998, ROCLING/IJCLCLP.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[26]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[27]  Min-Yen Kan,et al.  Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.

[28]  David Yarowsky,et al.  Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora , 2008, EMNLP.

[29]  Lakhmi C. Jain,et al.  Introduction to Bayesian Networks , 2008 .

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Hwee Tou Ng,et al.  A Maximum Entropy Approach to Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[32]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[33]  Tao Chen,et al.  Re-tweeting from a linguistic perspective , 2012 .

[34]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[35]  Kam-Fai Wong,et al.  Normalization of Chinese chat language , 2008, Lang. Resour. Evaluation.