论文信息 - Hybrid Page Layout Analysis via Tab-Stop Detection

Hybrid Page Layout Analysis via Tab-Stop Detection

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.

Raymond W. Smith

[1] R. Smith,et al. An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2] Apostolos Antonacopoulos,et al. ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[3] Ming Chen,et al. Unified HMM-based layout analysis framework and algorithm , 2007, Science in China Series F: Information Sciences.

[4] Thomas M. Breuel,et al. Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[5] Friedrich M. Wahl,et al. Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[6] Amit Kumar Das,et al. Segmentation of Text and Graphics from Document Images , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[7] George Nagy,et al. HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[8] Henry S. Baird,et al. Image segmentation by shape-directed covers , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[9] Jiangying Zhou,et al. Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[10] Basilios Gatos,et al. ICDAR2005 page segmentation competition , 2007, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).