The effects of part-of-speech tagsets on tagger performance

In natural language processing (NLP), a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with part-of-speech labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based : a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabelled data. Previous work has tended to focus on applying new algorithms to the problem or adding hand-tuned features to assist in classifying difficult instances. Using these methods, a number of distinct approaches have plateaued to similar accuracy figures of 96.9± 0.3%. Here we approach the problem of improving accuracy in POS tagging from a unique angle. We use a representative set of tagging algorithms and attempt to optimise performance by modifying the inventory of tags (or tagset) used in the pre-labelled training data . We modify tagsets by systematically mapping the tags of the training data to a new tagset. Our aim is to produce a tagset which is more conducive to automatic POS tagging by more accurately reflecting the underlying lingustic distinctions which should be encoded in a tagset. The mappings are reversible, enabling the original tags to be trivially recovered, which facilitates comparison with previous work and between competing mappings. We explore two different broad sources of these mappings. Our primary focus is on using linguistic insight to determine potentially useful distinctions which we can then evaluate empirically. We also evaluate an alternative data-driven approach for extracting patterns of regularity in a tagged corpus. Our experiments indicate the approach is not as successful as we had predicted. Our most successful mappings were data-driven, which give improvements of approximately 0.01% in token level accuracy over the development set using specific taggers, with increments of 0.03% over the test set. We show a wide range of linguistically motivated modifications which cause a performance decrement, while the best linguistic approaches maintain performance approximately over the development data and produce up to 0.05% improvement over the development data. Our results lead us to believe that this line of research is unlikely to provide significant gains over conventional approaches to POS tagging.

[1]  Yuji Matsumoto,et al.  Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines , 2001, NLPRS.

[2]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[3]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[4]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[5]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[6]  Grace Ngai,et al.  Transformation Based Learning in the Fast Lane , 2001, NAACL.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[9]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[10]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[11]  Eric Brill,et al.  Classifier Combination for Improved Lexical Disambiguation , 1998, ACL.

[12]  David Crystal,et al.  The Cambridge Encyclopedia of Language , 2012, Modern Language Review.

[13]  Jes Us Gim Enez And Llu Fast and Accurate Part{of{speech Tagging: the Svm Approach Revisited , 2003 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[15]  Andrei Mikheev,et al.  Tagging Sentence Boundaries , 2000, ANLP.

[16]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[19]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[20]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[21]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.