New Word Detection and Tagging on Chinese Twitter Stream

Twitter becomes one of the critical channels for disseminating up-to-date information. The volume of tweets can be huge. It is desirable to have an automatic system to analyze tweets. The obstacle is that Twitter users usually invent new words using non-standard rules that appear in a burst within a short period of time. Existing new word detection methods are not able to identify them effectively. Even if the new words can be identified, it is difficult to understand their meanings. In this paper, we focus on Chinese Twitter. There are no natural word delimiters in a sentence, which makes the problem more difficult. To solve the problem, we derive an unsupervised new word detection framework without relying on training data. Then, we introduce automatic tagging to new word annotation which tag the new words using known words according to our proposed tagging algorithm.

[1]  GeunSik Jo,et al.  Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation , 2010, Electron. Commer. Res. Appl..

[2]  S. Dumais Latent Semantic Analysis. , 2005 .

[3]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[4]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[5]  Robert M. Haralick,et al.  Feature normalization and likelihood-based similarity measures for image retrieval , 2001, Pattern Recognit. Lett..

[6]  Siu-Ming Yiu,et al.  Unknown Chinese word extraction based on variety of overlapping strings , 2013, Inf. Process. Manag..

[7]  Maosong Sun,et al.  Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures , 2003, SIGHAN.

[8]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[9]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[10]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[11]  Maosong Sun,et al.  Word Segmentation on Chinese Mirco-Blog Data with a Linear-Time Incremental Model , 2012, CIPS-SIGHAN.

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Kumiko Tanaka-Ishii,et al.  Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.

[14]  Ning Zhou,et al.  A Hybrid Probabilistic Model for Unified Collaborative and Content-Based Image Tagging , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[16]  Xu Sun,et al.  Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection , 2012, ACL.

[17]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[18]  Hai Zhao,et al.  Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation , 2008 .

[19]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[20]  Lidia S. Chao,et al.  CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data , 2012, CIPS-SIGHAN.

[21]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[22]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.