The OCRopus open source OCR system

OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.

[1]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[2]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[3]  Victor Zue,et al.  Speech recognition using stochastic explicit-segment modeling , 1991, EUROSPEECH.

[4]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[5]  Thomas M. Breul Recognition of handprinted digits using optimal bounded error matching , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[7]  Thomas M. Breuel Design and Implementation of a System for the Recognition of Handwritten Responses on US Census Forms , 1994 .

[8]  Luc Vincent,et al.  Ground-truthing and benchmarking document page segmentation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[9]  Isabelle Guyon,et al.  DATA SETS FOR OCR AND DOCUMENT IMAGE UNDERSTANDING RESEARCH , 1997 .

[10]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[11]  Thomas M. Breuel Robust least-square-baseline finding using a branch and bound algorithm , 2001, IS&T/SPIE Electronic Imaging.

[12]  Thomas M. Breuel,et al.  Segmentation of handprinted letter strings using a dynamic programming algorithm , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[13]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[14]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[15]  Thomas M. Breuel,et al.  Pixel-Accurate Representation and Evaluation of Page Segmentation in Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[16]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.

[17]  T. M. Breuel,et al.  The hOCR Microformat for OCR Workflow and Results , 2007 .

[18]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[19]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[20]  Thomas M. Breuel,et al.  Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[21]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[22]  Thomas M. Breuel,et al.  Page Frame Detection for Marginal Noise Removal from Scanned Documents , 2007, SCIA.

[23]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[24]  Thomas M. Breuel Binary Morphology and Related Operations on Run-Length Representations , 2008, VISAPP.