The SCRIBO Module of the Olena Platform: A Free Software Framework for Document Image Analysis

Electronic documents are being more and more usable thanks to better and more affordable network, storage and computational facilities. But in order to benefit from computer-aided document management, paper documents must be digitized and analyzed. This task may be challenging at several levels. Data may be of multiple types thus requiring different adapted processing chains. The tools to be developed should also take into account the needs and knowledge of users, ranging from a simple graphical application to a complete programming framework. Finally, the data sets to process may be large. In this paper, we expose a set of features that a Document Image Analysis framework should provide to handle the previous issues. In particular, a good strategy to address both flexibility and efficiency issues is the Generic Programming (GP) paradigm. These ideas are implemented as an open source module, SCRIBO, built on top of Olena, a generic and efficient image processing platform. Our solution features services such as preprocessing filters, text detection, page segmentation and document reconstruction (as XML, PDF or HTML documents). This framework, composed of reusable software components, can be used to create full-fledged graphical applications, small utilities, or processing chains to be integrated into third-party projects.

[1]  Bertrand Coüasnon DMOS: a generic document recognition method, application to an automatic generator of musical scores, mathematical formulae and table structures recognition systems , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[2]  Raymond W. Smith Hybrid Page Layout Analysis via Tab-Stop Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Laurent Najman,et al.  Why and howto design a generic and efficient image processing framework: The case of the Milena library , 2010, 2010 IEEE International Conference on Image Processing.

[4]  Ichiro Fujinaga,et al.  The Gamera framework for building custom recognition systems , 2003 .

[5]  Karl Tombre,et al.  The Search for Genericity in Graphics Recognition Applications: Design Issues of the Qgar Software System , 2004, Document Analysis Systems.

[6]  M.H.F. Wilkinson,et al.  Connected operators , 2009, IEEE Signal Processing Magazine.

[7]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[8]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[9]  David L. Donoho,et al.  WaveLab and Reproducible Research , 1995 .

[10]  Sergey Fomel,et al.  Guest Editors' Introduction: Reproducible Research , 2009, Comput. Sci. Eng..

[11]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[12]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[13]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[14]  Manuel Menezes de Oliveira Neto,et al.  Real-time line detection through an improved Hough transform voting scheme , 2008, Pattern Recognit..

[15]  Thierry Géraud,et al.  Semantics-Driven Genericity : A Sequel to the Static C + + Object-Oriented Programming Paradigm ( SCOOP 2 ) , 2008 .