A Multi-Domain Web-Based Algorithm for POS Tagging of Unknown Words

We present a web-based algorithm for the task of POS tagging of unknown words (words appearing only a small number of times in the training data of a supervised POS tagger). When a sentence s containing an unknown word u is to be tagged by a trained POS tagger, our algorithm collects from the web contexts that are partially similar to the context of u in s, which are then used to compute new tag assignment probabilities for u. Our algorithm enables fast multi-domain unknown word tagging, since, unlike previous work, it does not require a corpus from the new domain. We integrate our algorithm into the MXPOST POS tagger (Ratnaparkhi, 1996) and experiment with three languages (English, German and Chinese) in seven in-domain and domain adaptation scenarios. Our algorithm provides an error reduction of up to 15.63% (English), 18.09% (German) and 13.57% (Chinese) over the original tagger.

[1]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[2]  Jeff A. Bilmes,et al.  Part-of-Speech Tagging using Virtual Evidence and Negative Training , 2005, HLT.

[3]  Ari Rappoport,et al.  Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining , 2007, ACL.

[4]  Likun Qiu,et al.  A Method for Automatic POS Guessing of Chinese Unknown Words , 2008, COLING.

[5]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[6]  Yuji Matsumoto,et al.  Guessing Parts-of-Speech of Unknown Words Using Global Information , 2006, ACL.

[7]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[8]  Ido Dagan,et al.  Scaling Web-based Acquisition of Entailment Relations , 2004, EMNLP.

[9]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[10]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[11]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[12]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[13]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[14]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[17]  Daniel Jurafsky,et al.  Morphological features help POS tagging of unknown words across language varieties , 2005, IJCNLP.

[18]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[19]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.

[20]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[21]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[22]  Patrick Pantel,et al.  VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[23]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[24]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[25]  Ming Zhou,et al.  Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[26]  James R. Curran,et al.  Tagging Unknown Words with Raw Text Features , 2005, ALTA.

[27]  Ari Rappoport,et al.  Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets , 2007, ACL.

[28]  Frank Keller,et al.  Using the Web to Overcome Data Sparseness , 2002, EMNLP.