Segmentation Methods for Recognition of Machine-Printed Characters

This paper reports an investigation of some methods for isolating, or segmenting, characters during the reading of machine-printed text by optical character recognition systems. Two new segmentation algorithms using feature extraction techniques are presented; both are intended for use in the recognition of machine-printed lines of 10-, 11- and 12-pitch serif-type multifont characters. One of the methods, called quasi-topological segmentation, bases the decision to “section” a character on a combination of feature-extraction and character-width measurements. The other method, topological segmentation, involves feature extraction alone. The algorithms have been tested with an evaluation method that is independent of any particular recognition system. Test results are based on application of the algorithm to upper-case alphanumeric characters gathered from print sources that represent the existing world of machine printing. The topological approach demonstrated better performance on the test data than did the quasi-topological approach.