Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?

I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semisupervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

[1]  Charles James Nice Bailey,et al.  New ways of analyzing variation in English , 1973 .

[2]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[3]  F. Heny,et al.  Linguistic Categories: Auxiliaries and Related Puzzles , 1983 .

[4]  J. Maling Transitive Adjectives: A Case of Categorial Reanalysis , 1983 .

[5]  Andrew Radford,et al.  Transformational Grammar: Contents , 1988 .

[6]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[7]  S. Harnad Categorical Perception: The Groundwork of Cognition , 1990 .

[8]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[11]  Atro Voutilainen,et al.  Specifying a shallow grammatical representation for parsing purposes , 1995, EACL.

[12]  Ann Bies,et al.  Bracketing Guidelines For Treebank II Style Penn Treebank Project , 1995 .

[13]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[14]  Atro Voutilainen,et al.  Comparing a Linguistic and a Stochastic Tagger , 1997, ACL.

[15]  Andrew Mackie,et al.  For Henry Kučera : studies in Slavic philology and computational linguistics , 1998 .

[16]  Joe F. Zhou,et al.  Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, : 21-22 June 1999, University of Maryland, College Park, MD, USA , 1999 .

[17]  K. Kafadar,et al.  Statistics: Concepts and Controversies , 1980 .

[18]  Yoram Singer,et al.  Boosting Applied to Tagging and PP Attachment , 1999, EMNLP.

[19]  Radford,et al.  转换生成语法教程 = Transformational Grammar , 2000 .

[20]  Steven Abney,et al.  Statistical Methods and Linguistics , 2002 .

[21]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[22]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[23]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[24]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[25]  H. Hughes The Cambridge Grammar of the English Language , 2003 .

[26]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[27]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[28]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[29]  Amy B. Henley,et al.  The Balancing Act—: At Work and At Home , 2004 .

[30]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[31]  Andrew MacKinlay,et al.  The effects of part-of-speech tagsets on tagger performance , 2005 .

[32]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[33]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[34]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[35]  Bas Aarts,et al.  Syntactic gradience : the nature of grammatical indeterminacy , 2007 .

[36]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[37]  Anders Søgaard,et al.  Simple Semi-Supervised Training of Part-Of-Speech Taggers , 2010, ACL.

[38]  Slav Petrov,et al.  Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models , 2010, EMNLP.