Using artificial neural networks to identify headings in newspaper documents

Several features for Neural Network based document region identification are tested. Specifically, this paper examines features for headline and subheadline region identification. The Neural Network based region identification algorithm is a key component of a document recognition system that segments a document into regions, classifies them into text, graphic, photo, and other region types, and then uses this classification to guide the processing and analysis of the image. The input data are unusually challenging: low quality images of newspaper documents obtained from microfilmed archives. Experiments on several newspaper documents show that the features used are capable of robust and accurate headline identification.

[1]  Véronique Eglin,et al.  Printed text featuring using the visual criteria of legibility and complexity , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[2]  Rama Chellappa,et al.  Page segmentation using decision integration and wavelet packets , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[3]  Seong-Whan Lee,et al.  Parameter-independent geometric document layout analysis , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[4]  Gerd Maderlechner,et al.  Extraction of relevant information from document images using measures of visual attention , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[5]  Hong Yan,et al.  Document page segmentation based on pattern spread analysis , 2000 .

[6]  Edward M. Riseman,et al.  TextFinder: An Automatic System to Detect and Recognize Text In Images , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Véronique Eglin,et al.  Visual exploration and functional document labeling , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Devesh Patel,et al.  Page segmentation for document image analysis using a neural network , 1996 .

[9]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[10]  Axel Pinz,et al.  Layout and analysis: Finding text, titles, and photos in digital images of newspaper pages , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[11]  Robert M. Haralick,et al.  Zone classification using texture features , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[12]  Thrasyvoulos N. Pappas,et al.  A robust and efficient algorithm for bilevel document block classification , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[13]  T. John Stonham,et al.  Document segmentation using texture analysis , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[14]  P. S. Williams,et al.  Generic texture analysis applied to newspaper segmentation , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[15]  Anil K. Jain,et al.  Page segmentation using texture discrimination masks , 1995, Proceedings., International Conference on Image Processing.

[16]  Hong Yan,et al.  Newspaper document analysis featuring connected line segmentation , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[17]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.