Probabilistic Retrieval Methods for Text with Miss-Recognized OCR Characters

This paper presents two probabilistic text retrieval methods speci cally designed to carry out a full-text search of Japanese documents containing OCR errors. By searching for any query term under the premise that errors exist in recognized text, the presented methods can tolerate such errors, and therefore manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store (i) characters which are likely to be interchanged when a particular character is miss-recognized, and (ii) the respective probability of each occurrence. Multiple search terms are generated for an input query term by referencing these matrices, after which a full-text search is applied for each search term. The validity of retrieved terms is determined based on the error-occurrence probabilities, and those with a validity value greater than a certain threshold are judged to satisfy the input query. In addition, method performance is experimentally evaluated by determining retrieval e ectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.