论文信息 - Improved string matching under noisy channel conditions

Improved string matching under noisy channel conditions

Many document-based applications, including popular Web browsers, email viewers, and word processors, have a 'Find on this Page' feature that allows a user to find every occurrence of a given string in the document. If the document text being searched is derived from a noisy process such as optical character recognition (OCR), the effectiveness of typical string matching can be greatly reduced. This paper describes an enhanced string-matching algorithm for degraded text that improves recall, while keeping precision at acceptable levels. The algorithm is more general than most approximate matching algorithms and allows string-to-string edits with arbitrary costs. We develop a method for evaluating our technique and use it to examine the relative effectiveness of each sub-component of the algorithm. Of the components we varied, we find that using confidence information from the recognition process lead to the largest improvements in matching accuracy.

Susan T. Dumais | Kevyn Collins-Thompson | Charles Schweizer

[1] Natasa Milic-Frayling,et al. OCR Correction and Query Expansion for Retrieval on OCR Data -- CLARIT TREC-5 Confusion Track Report , 1996, TREC.

[2] Eugene W. Myers. A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[3] Tao Hu,et al. Document retrieval tolerating character recognition errors--evaluation and application , 1997, Pattern Recognit..

[4] Udi Manber,et al. Fast text searching: allowing errors , 1992, CACM.

[5] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Victor Zue,et al. Subword unit representations for spoken document retrieval , 1997, EUROSPEECH.

[7] Eugene W. Myers,et al. A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[8] Ellen M. Voorhees,et al. The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[9] Kazem Taghva,et al. The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[10] Eric Brill,et al. An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[11] Elke Mittendorf. Data corruption and information retrieval , 1998 .

[12] Karen Spärck Jones,et al. Video mail retrieval: the effect of word spotting accuracy on precision , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.