Text identification for document image analysis using a neural network

Abstract A new bottom-up method is described that clusters the content of a mixed type document into text or non-text areas. The proposed approach is based on a new set of features combined with a self-organized neural network classifier. The set of features corresponds to the contents and the relationship of 3×3 masks, is selected by using a statistical reduction procedure, and provides texture information. Next, a Principal Components Analyzer (PCA) is applied, which results in a reduced number of `effective' features. The final set of features is then utilized as input vector into a proper neural network to achieve the classification goal. The neural network classifier is based on a Kohonen Self Organized Feature Map (SOFM). Document blocks are classified as text, graphics, and halftones or to secondary subclasses corresponding to special cases of the primal classes. The proposed method can identify text regions included in graphics or even overlapped regions, that is, regions that cannot be separated with horizontal and vertical cuts. The performance of the method was extensively tested on a variety of documents with very promising results.

[1]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[2]  Kuo-Chin Fan,et al.  Segmentation and classification of mixed text/graphics/image documents , 1994, Pattern Recognit. Lett..

[3]  D. J. Nolan,et al.  Automatic defect classification of printed wiring board solder joints , 1990 .

[4]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[5]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[6]  C. Strouthopoulos,et al.  Identification of text-only areas in mixed-type documents , 1997 .

[7]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[8]  J. Dayho Neural Network Architectures: an Introduction , 1990 .

[9]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[10]  Sargur N. Srihari,et al.  Classification of newspaper image blocks using texture analysis , 1989, Comput. Vis. Graph. Image Process..

[11]  Eberhard Mandler,et al.  Document analysis-from pixels to contents , 1992 .

[12]  Du-Ming Tsai,et al.  A fast histogram-clustering approach for multi-level thresholding , 1992, Pattern Recognit. Lett..

[13]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Abhijit S. Pandya,et al.  Pattern Recognition with Neural Networks in C++ , 1995 .

[16]  James R. Gattiker,et al.  A System for Interpretation of Line Drawings , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[18]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[19]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[20]  Nikos Papamarkos,et al.  A New Approach for Multilevel Threshold Selection , 1994, CVGIP Graph. Model. Image Process..

[21]  Yasuaki Nakano,et al.  Segmentation methods for character recognition: from segmentation to document structure analysis , 1992, Proc. IEEE.

[22]  Anil K. Jain,et al.  Page segmentation using tecture analysis , 1996, Pattern Recognit..