Prediction of OCR Accuracy

The accuracy of all contemporary OCR technologies varies drastically as a function of input image quality [Rice 92, Rice 93, Chen 93, Rice 94]. Given high quality images, many devices consistently deliver output text in excess of 99% correct. For low quality images, even images which are easily read by a human, output accuracy is frequently below 90%. This extreme sensitivity to quality is well known in the document analysis field and is the subject of much current research. In this ongoing project, we have been interested in developing measures of image quality. We are especially interested in learning to predict OCR accuracy using some combination of image quality measures independent of OCR devices themselves. Reliable algorithms for measuring print quality and predicting OCR accuracy would be valuable in several ways. First, in large scale document conversion operations, they could be used to automatically filter out pages that are more economically recovered via manual entry. Second, they might be employed iteratively as part of an adaptive image enhancement system. At the same time, studies into the nature of image quality can contribute to our overall understanding of the effect of noise on algorithms for classification. In this paper, we propose a prediction technique based upon measuring features associated with degraded characters. In order to limit the scope of the research, the following assumptions are made: