Document structure analysis algorithms: a literature survey

Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.

[1]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[2]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  S. Tsujimoto,et al.  Understanding multi-articled documents , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[4]  Henry S. Baird,et al.  Image segmentation by shape-directed covers , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[5]  Yasuaki Nakano,et al.  Segmentation methods for character recognition: from segmentation to document structure analysis , 1992, Proc. IEEE.

[6]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[7]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[8]  Alan Conway,et al.  Page grammars and page parsing. A syntactic approach to document layout recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Takashi Saitoh,et al.  Document image segmentation and text area ordering , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[11]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Yuka Tateisi,et al.  Using stochastic syntactic analysis for extracting a logical structure from a document image , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[13]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Luc Vincent,et al.  Ground-truthing and benchmarking document page segmentation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[16]  George Nagy,et al.  Automated Evaluation of OCR Zoning , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Kristen Maria Summers Near-wordless document structure classification , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[18]  Sargur N. Srihari,et al.  Knowledge-based derivation of document logical structure , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[19]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[20]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[21]  Andreas Dengel,et al.  Computer understanding of document structure , 1996 .

[22]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[23]  Rolf Ingold,et al.  Modeling documents for structure recognition using generalized N-grams , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[24]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[26]  Venu Govindaraju,et al.  Information theoretic analysis of postal address fields for automatic address interpretation , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[27]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[28]  Daniel X. Le,et al.  Automated labeling in document images , 2000, IS&T/SPIE Electronic Imaging.

[29]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Philip A. Chou,et al.  Turbo recognition: a statistical approach to layout analysis , 2000, IS&T/SPIE Electronic Imaging.

[31]  Matthew Hurst Layout and language: an efficient algorithm for detecting text blocks based on spatial and linguistic evidence , 2000, IS&T/SPIE Electronic Imaging.

[32]  Douglas W. Oard,et al.  Translation lexicon acquisition from bilingual dictionaries , 2001, IS&T/SPIE Electronic Imaging.

[33]  Song Mao,et al.  Stochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries , 2001 .

[34]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Robert M. Haralick,et al.  Performance Evaluation of Document Structure Extraction Algorithms , 2001, Comput. Vis. Image Underst..

[36]  Song Mao,et al.  Software architecture of PSET: a page segmentation evaluation toolkit , 2002, International Journal on Document Analysis and Recognition.

[37]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[38]  Yasuto Ishitani Logical Structure Analysis of Document Images Based on Emergent Computation , 2005, IEICE Trans. Inf. Syst..