Modeling the sample distribution for clustering OCR

The paper re-examines a well-known technique in OCR, recognition by clustering followed by cryptanalysis, from a Bayesian perspective. The advantage of such techniques is that they are font-independent, but they appear not to have offered competitive performance with other pattern recognition techniques in the past. The analysis presented in this paper suggests an approach to OCR that is based on modeling the sample distribution as a mixture of Gaussians. Results suggest that such an approach may combine the advantages of cluster- based OCR with the performance of traditional classification algorithms.

[1]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[2]  Daniel P. Huttenlocher,et al.  Digipaper: a versatile color document image representation , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[3]  Jonathan J. Hull,et al.  Improving ocr performance with word image equivalence , 1995 .

[4]  Tin Kam Ho,et al.  Enhancing degraded document images via bitmap clustering and averaging , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Pascal Vincent,et al.  Color documents on the Web with DjVu , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[6]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[7]  George Nagy,et al.  Style consistency in pattern fields , 2000 .

[8]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .