Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms for words to save time and effort. We call that noisy text. Part-Of-Speech (POS) tagging has reached high levels of accuracy enabling the use of automatic POS tags in various language processing tasks, however, tagging performance on noisy text degrades very fast. This paper is an attempt to adapt a state-of-the-art English POS tagger, which is trained on the Wall-Street-Journal (WSJ) corpus, to noisy text. We classify the noise in text into different types and evaluate the tagger with respect to each type of noise. The problem of tagging noisy text is attacked in two ways; a) Trying to overcome noise as a post processing step to the tagging b) Cleaning the noise and then doing tagging. We propose techniques to solve the problem in both the ways and critically compare them based on the error analysis. We demonstrate the working of the proposed models on a Short Message Service (SMS) dataset which achieve a significant improvement over the baseline accuracy of tagging noisy words by a state-of-the-art English POS tagger.

[1]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[2]  Yuji Matsumoto,et al.  Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines , 2001, NLPRS.

[3]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4]  Jennifer Foster,et al.  Treebanks gone bad , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[5]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[6]  Jennifer Foster,et al.  Parsing Ungrammatical Input: an Evaluation Procedure , 2004, LREC.

[7]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[8]  Shourya Roy,et al.  Language independent unsupervised learning of short message service dialect , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Daniel Jurafsky,et al.  Morphological features help POS tagging of unknown words across language varieties , 2005, IJCNLP.

[11]  Animesh Mukherjee,et al.  Investigation and modeling of the structure of texting language , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[12]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Shankar Kumar,et al.  Normalization of non-standard words , 2001, Comput. Speech Lang..

[15]  Mauro Cettolo,et al.  Statistical Machine Translation of Texts with Misspelled Words , 2010, NAACL.

[16]  Mary P. Harper,et al.  Lessons Learned in Part-of-Speech Tagging of Conversational Speech , 2010, EMNLP.

[17]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[18]  Jennifer Foster "cba to check the spelling": Investigating Parser Performance on Discussion Forum Posts , 2010, HLT-NAACL.

[19]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[20]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[21]  S. Herring Computer‐Mediated Discourse , 2005 .

[22]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[23]  Shourya Roy,et al.  Special issue on noisy text analytics , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[24]  Shourya Roy,et al.  A survey of types of text noise and techniques to handle noisy text , 2009, AND '09.

[25]  Kristina Toutanova,et al.  Pronunciation Modeling for Improved Spelling Correction , 2002, ACL.

[26]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[27]  Josef van Genabith,et al.  Adapting a WSJ-Trained Parser to Grammatically Noisy Text , 2008, ACL.

[28]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.

[29]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .