论文信息 - Impact of imperfect OCR on part-of-speech tagging

Impact of imperfect OCR on part-of-speech tagging

Part-of-speech (POS) tagging is the foundation of natural language processing (NLP) systems, and thus has been an active area of research for many years. However, one question remains unanswered: How will a POS tagger behave when the input text is not error-free? This issue can be of great importance when the text comes from imperfect sources like optical character recognition (OCR). This paper analyzes the performance of both individual POS taggers and combination systems on imperfect text. Experimental results show that a POS tagger's accuracy decreases linearly with the character error rate and the slope indicates a tagger's sensitivity to input text errors.

Xiaofan Lin | Xiaofan Lin

[1] Walter Daelemans,et al. MBT: A Memory-Based Part of Speech Tagger-Generator , 1996, VLC@COLING.

[2] P MarcusMitchell,et al. Building a large annotated corpus of English , 1993 .

[3] Hans van Halteren,et al. Improving Data Driven Wordclass Tagging by System Combination , 1998, ACL.

[4] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6] Erik F. Tjong Kim Sang,et al. Noun Phrase Recognition by System Combination , 2000, ANLP.

[7] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[8] Proceedings Seventh International Conference on Document Analysis and Recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9] Eric Brill,et al. A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[10] Walter Daelemans,et al. Improving Data Driven Wordclass Tagging by System Combination , 2022, International Conference on Computational Linguistics.

[11] Walter Daelemans,et al. Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[12] Eric Brill,et al. Classifier Combination for Improved Lexical Disambiguation , 1998, ACL.