论文信息 - Developing a Robust Part-of-Speech Tagger for Biomedical Text

Developing a Robust Part-of-Speech Tagger for Biomedical Text

This paper presents a part-of-speech tagger which is specifically tuned for biomedical text. We have built the tagger with maximum entropy modeling and a state-of-the-art tagging algorithm. The tagger was trained on a corpus containing newspaper articles and biomedical documents so that it would work well on various types of biomedical text. Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and our tagger exhibits very good precision (97% to 98%) on all these corpora. We also evaluated the robustness of the tagger using recent MEDLINE articles.

[1] Jun'ichi Tsujii,et al. Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[2] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[3] Yuji Matsumoto,et al. Chunking with Support Vector Machines , 2001, NAACL.

[4] Stanley F. Chen,et al. A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[5] Jesús Giménez,et al. Fast and accurate part-of-speech tagging , 2004 .

[6] Jun'ichi Tsujii,et al. Part-of-Speech Annotation of Biology Research Abstracts , 2004, LREC.

[7] Michael Collins,et al. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[8] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[9] Daniel M. Bikel,et al. Intricacies of Collins’ Parsing Model , 2004, CL.

[10] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11] Seth Kulick,et al. Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[12] Jin-Dong Kim,et al. The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[13] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.