High-performance OCR preclassification trees

We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then `populate' the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error: this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user- specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree: the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.

[1]  Ching Y. Suen,et al.  Large Tree Classifier with Heuristic Search and Global Training , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Henry S. Baird,et al.  Document image defect models , 1995 .

[3]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[4]  George Nagy,et al.  Decision tree design using a probabilistic model , 1984, IEEE Trans. Inf. Theory.

[5]  Henry S. Baird,et al.  Feature identification for hybrid structural/statistical pattern classification , 1988, Comput. Vis. Graph. Image Process..