The extraction and recognition of text from multimedia document images

Almost all the current commercial OCR machines employ matrix matching, resulting in high speed and accuracy, but a severely restrictive range of recognized fonts. Published algorithms conversely, concentrate on feature extraction for font independence, yet they have previously been too slow for commercial use. Current algorithms also fail to distinguish between text and non-text images. This thesis presents a new approach to the automatic extraction of text from multimedia printed documents. An edge detection algorithm, which is capable of extracting the outlines of text from a grey level image, is used to obtain a high level of discrimination between text and non-text. An additional benefit is that text of any colour can be read from almost any background, provided that the contrast is reasonable. The outlines are approximated by polygons using a fast two-stage algorithm. A feature extraction approach to font independent character recognition is described, which uses these outline polygons. It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features. The results show that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set. A more complex edge extraction algorithm is also described. This is capable of extracting text and line graphics from an arbitrary page. Although not essential for character recognition, this algorithm is useful for the interpretation of engineering drawings. As a further contribution to this problem, a thinning algorithm is defined, which is non-iterative and uses the polygonal approximated outlines from the edge extractor.

[1]  Azriel Rosenfeld,et al.  Some Parallel Thinning Algorithms for Digital Pictures , 1971, JACM.

[2]  M. J. Minneman Handwritten character recognition employing topology, cross correlation, and decision theory , 1966 .

[3]  J. Ullmann Picture analysis in character recognition , 1976 .

[4]  Y. Okumura,et al.  An electronic reading machine , 1959, IFIP Congress.

[5]  L. D. Harmon,et al.  Automatic recognition of print and script , 1972 .

[6]  Theo Pavlidis,et al.  A vectorizer and feature extractor for document recognition , 1986 .

[7]  Jack Sklansky,et al.  Fast polygonal approximation of digitized curves , 1980, Pattern Recognit..

[8]  Drew H. Abney,et al.  Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .

[9]  Bidyut B. Chaudhuri,et al.  Digital line segment coding: A new efficient contour coding scheme , 1984 .

[10]  Robert J Shillman,et al.  Preliminary Steps in the Design of Optical Character Recognition Algorithms , 1977 .

[11]  Richard L. Grimsdale,et al.  A system for the automatic recognition of patterns , 1959 .

[12]  Tony Kasvand,et al.  Critical points on a perfectly 8- or 6-connected thin binary line , 1983, Pattern Recognit..

[13]  J. M. Brady,et al.  Using knowledge in the computer interpretation of handwritten FORTRAN coding sheets , 1976 .

[14]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  T. Pavlidis Algorithms for Graphics and Image Processing , 1981, Springer Berlin Heidelberg.