Post-processing of OCR results for automatic indexing

The indexing of inaccurately recognized OCR text yields unsatisfactory results, where the quality of the index terms decreases rapidly when the quality of the documents get worse. Index terms of OCR processed documents can be used for archiving or classification tasks. We present an indexing component whose input are character hypothesis lattices which are post-processed by a generate-and-test component feeding a morphology, a rule based substitution system, and a trigram correction component with word candidates. Stop words are filtered by a Levenshtein-based elimination routine. The recognized words are subsequently processed by our indexing component. Our system minimizes the number of generated index terms which are correct German words. The experiments have shown an increase in accuracy of next to 10%.

[1]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[2]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[3]  Rainer Hoch,et al.  On virtual partitioning of large dictionaries for contextual post-processing to improve character recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[4]  Rainer Hoch,et al.  On virtual partitioning of large dictionaries for contextual post-processing to improve character recognition , 1993 .

[5]  Rainer Hoch,et al.  Using a partitioned dictionary for contextual post-processing of OCR-results , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[6]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[7]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[8]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Rainer Hoch,et al.  Using IR techniques for text classification in document analysis , 1994, SIGIR '94.

[10]  Rainer Hoch,et al.  Intelligent Interfaces between Paper and Computer , 1993 .

[11]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.