论文信息 - A Multi-Domain Web-Based Algorithm for POS Tagging of Unknown Words - 字舞流文

A Multi-Domain Web-Based Algorithm for POS Tagging of Unknown Words

We present a web-based algorithm for the task of POS tagging of unknown words (words appearing only a small number of times in the training data of a supervised POS tagger). When a sentence s containing an unknown word u is to be tagged by a trained POS tagger, our algorithm collects from the web contexts that are partially similar to the context of u in s, which are then used to compute new tag assignment probabilities for u. Our algorithm enables fast multi-domain unknown word tagging, since, unlike previous work, it does not require a corpus from the new domain. We integrate our algorithm into the MXPOST POS tagger (Ratnaparkhi, 1996) and experiment with three languages (English, German and Chinese) in seven in-domain and domain adaptation scenarios. Our algorithm provides an error reduction of up to 15.63% (English), 18.09% (German) and 13.57% (Chinese) over the original tagger.

Ari Rappoport | Roi Reichart | Shulamit Umansky-Pesin | A. Rappoport | Roi Reichart | Shulamit Umansky-Pesin

[1] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[2] Jeff A. Bilmes,et al. Part-of-Speech Tagging using Virtual Evidence and Negative Training , 2005, HLT.

[3] Ari Rappoport,et al. Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining , 2007, ACL.

[4] Likun Qiu,et al. A Method for Automatic POS Guessing of Chinese Unknown Words , 2008, COLING.

[5] Christopher D. Manning,et al. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[6] Yuji Matsumoto,et al. Guessing Parts-of-Speech of Unknown Words Using Global Information , 2006, ACL.

[7] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[8] Ido Dagan,et al. Scaling Web-based Acquisition of Entailment Relations , 2004, EMNLP.

[9] Nianwen Xue,et al. Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[10] Frank Keller,et al. Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[11] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[12] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[13] Sabine Brants,et al. The TIGER Treebank , 2001 .

[14] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[15] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[17] Daniel Jurafsky,et al. Morphological features help POS tagging of unknown words across language varieties , 2005, IJCNLP.

[18] Matthew Lease,et al. Parsing Biomedical Literature , 2005, IJCNLP.

[19] Eugene Charniak,et al. Effective Self-Training for Parsing , 2006, NAACL.

[20] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[21] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[22] Patrick Pantel,et al. VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[23] Jun'ichi Tsujii,et al. GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[24] James R. Curran,et al. Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[25] Ming Zhou,et al. Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[26] James R. Curran,et al. Tagging Unknown Words with Raw Text Features , 2005, ALTA.

[27] Ari Rappoport,et al. Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets , 2007, ACL.

[28] Frank Keller,et al. Using the Web to Overcome Data Sparseness , 2002, EMNLP.