Automatic error correction and query evaluation of OCR generated text

The method used in our error correction system is based on three principles: 1) approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary 2) local information obtained from the individual documents 3) the use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device. This system is utilized to process a database composed of approximately 9300 pages of OCR generated documents.