论文信息 - Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters - 字舞流文

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90% to 93% accuracy (more than 3% absolute). Qualitative analysis of these word clusters yields insights about NLP and linguistic phenomena in this genre. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines. Tagging software, annotation guidelines, and large-scale word clusters are available at: http://www.ark.cs.cmu.edu/TweetNLP This paper describes release 0.3 of the “CMU Twitter Part-of-Speech Tagger” and annotated data. [This paper is forthcoming in Proceedings of NAACL 2013; Atlanta, GA, USA.]

Brendan T. O'Connor | Noah A. Smith | Chris Dyer | Kevin Gimpel | Nathan Schneider | Brendan T. O'Connor | Olutobi Owoputi | Chris Dyer | Kevin Gimpel | Nathan Schneider | Olutobi Owoputi

[1] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[2] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[3] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[5] Andrew McCallum,et al. Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[6] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7] James Allan,et al. Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[8] Peter D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[9] Alexander Clark,et al. Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[10] Percy Liang,et al. Semi-Supervised Learning for Natural Language , 2005 .

[11] Noah A. Smith,et al. Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[12] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[13] Eric N. Forsyth. Improving automated lexical and discourse analysis of online chat dialog , 2007 .

[14] Craig H. Martell,et al. Lexical and Discourse Analysis of Online Chat Dialog , 2007, International Conference on Semantic Computing (ICSC 2007).

[15] Jianfeng Gao,et al. Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[16] Xavier Carreras,et al. Simple Semi-supervised Dependency Parsing , 2008, ACL.

[17] Eugene Charniak,et al. Automatic Domain Adaptation for Parsing , 2010, NAACL.

[18] Brendan T. O'Connor,et al. TweetMotif: Exploratory Search and Topic Summarization for Twitter , 2010, ICWSM.

[19] Yoshua Bengio,et al. Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[20] Estevam R. Hruschka,et al. Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[21] Oren Etzioni,et al. Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[22] Josef van Genabith,et al. #hardtoparse: POS Tagging and Parsing the Twitterverse , 2011, Analyzing Microtext.

[23] Timothy Baldwin,et al. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[24] Brendan T. O'Connor,et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[25] Eric P. Xing,et al. Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[26] Nicholas Diakopoulos,et al. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs , 2011, EMNLP.

[27] Ming Zhou,et al. Recognizing Named Entities in Tweets , 2011, ACL.

[28] Phil Blunsom,et al. A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction , 2011, ACL.

[29] Oren Etzioni,et al. Identifying Relations for Open Information Extraction , 2011, EMNLP.

[30] T. Schnoebelen. Do You Smile with Your Nose? Stylistic Variation in Twitter Emoticons , 2012 .

[31] Chris Dyer,et al. Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[32] Timothy Baldwin,et al. langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[33] Jakob Uszkoreit,et al. Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[34] Slav Petrov,et al. Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[35] Slav Petrov,et al. A Universal Part-of-Speech Tagset , 2011, LREC.

[36] Jacob Eisenstein,et al. What to do about bad language on the internet , 2013, NAACL.