Probabilistic Automaton Model for Fuzzy English-Text Retrieval

Optical character reader (OCR) misrecognition is a serious problem when searching against OCR-scanned documents in databases such as digital libraries. This paper proposes fuzzy retrieval methods for English text that contains errors in the recognized text without correcting the errors manually. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term based on probabilistic automata reflecting both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.56% to 97.88% at the cost of a decrease in precision rate from 100.00% to 95.52% with 20 expanded search terms.

[1]  Kazem Taghva,et al.  Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model , 1996, Inf. Process. Manag..

[2]  W. B. Croft,et al.  An Evaluation of Information Retrieval Accuracy with Simulated OCR Output , 1993 .

[3]  Kazem Taghva,et al.  Evaluation of model-based retrieval effectiveness with OCR text , 1996, TOIS.

[4]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[5]  Kazem Taghva,et al.  MANICURE document processing system , 1998, Electronic Imaging.

[6]  Daniel P. Lopresti Robust retrieval of noisy text , 1996, Proceedings of the Third Forum on Research and Technology Advances in Digital Libraries,.

[7]  Atsuhiro Takasu,et al.  Reduction of Expanded Search Terms for Fuzzy English-Text Retrieval , 1998, ECDL.

[8]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[9]  Julie Borsack,et al.  Expert system for automatically correcting OCR output , 1994, Electronic Imaging.

[10]  Julie Borsack,et al.  Evaluation of an automatic markup system , 1995, Electronic Imaging.

[11]  Kazuo Ohta,et al.  Advances in Cryptology — ASIACRYPT’98 , 2002, Lecture Notes in Computer Science.