Learning a Part-of-Speech Tagger from Two Hours of Annotation

Most work on weakly-supervised learning for part-of-speech taggers has been based on unrealistic assumptions about the amount and quality of training data. For this paper, we attempt to create true low-resource scenarios by allowing a linguist just two hours to annotate data and evaluating on the languages Kinyarwanda and Malagasy. Given these severely limited amounts of either type supervision (tag dictionaries) or token supervision (labeled sentences), we are able to dramatically improve the learning of a hidden Markov model through our method of automatically generalizing the annotations, reducing noise, and inducing word-tag frequency information.

[1]  Kevin Knight,et al.  Minimized Models for Unsupervised Part-of-Speech Tagging , 2009, ACL/IJCNLP.

[2]  Slav Petrov,et al.  Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models , 2010, EMNLP.

[3]  Vincent Ng,et al.  Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages , 2009, EACL.

[4]  Yoav Goldberg,et al.  EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start) , 2008, ACL.

[5]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[8]  Ben Taskar,et al.  Wiki-ly Supervised Part-of-Speech Tagging , 2012, EMNLP.

[9]  Weiwei Ding,et al.  Weakly supervised part-of-speech tagging for Chinese using label propagation , 2011 .

[10]  Jason Baldridge,et al.  Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries , 2012, EMNLP.

[11]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[12]  Joakim Nivre,et al.  Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging , 2013, TACL.

[13]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[14]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[15]  Koby Crammer,et al.  New Regularized Algorithms for Transductive Learning , 2009, ECML/PKDD.

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  Ashish Vaswani,et al.  Fast, Greedy Model Minimization for Unsupervised Tagging , 2010, COLING.

[18]  David Yarowsky,et al.  Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day , 2002, CoNLL.