An Open Architecture for End-to-End Document Analysis Benchmarking

In this paper, we present a fully operational, scalable and open architecture allowing end-to-end document analysis benchmarking without needing to develop the whole pipeline. By decomposing the analysis process into coarse-grained tasks, and by building upon community provided state-of-the art algorithms, our architecture allows any combination of elementary document analysis algorithms, regardless their running system environment, programming language or data structures. Its flexible structure makes it straightforward to plug in new algorithms, compare them to other algorithms, and observe the effects on end-to-end tasks without need to install, compile or otherwise interact with any other software than one's own.

[1]  K. Popper,et al.  Logik der Forschung , 1935 .

[2]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[3]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[4]  Karl Tombre,et al.  The Search for Genericity in Graphics Recognition Applications: Design Issues of the Qgar Software System , 2004, Document Analysis Systems.

[5]  Ernest Valveny,et al.  A general framework for the evaluation of symbol recognition methods , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[6]  David Booth,et al.  Web Services Description Language (WSDL) Version 2.0 Part 0: Primer , 2007 .

[7]  George R. Thoma Automating the production of bibliographic records for MEDLINE , 2001 .

[8]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[9]  Jon Watson,et al.  VirtualBox: bits and bytes masquerading as machines , 2008 .

[10]  Daniel P. Lopresti,et al.  Document analysis research in the year 2021 , 2011, IEA/AIE'11.

[11]  Kevin Chen,et al.  DOCLIB: a software library for document processing , 2006, Electronic Imaging.

[12]  Daniel P. Lopresti,et al.  A platform for storing, visualizing, and interpreting collections of noisy documents , 2010, AND '10.

[13]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[14]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[15]  Henry D. Shapiro,et al.  Algorithms and Experiments: The New (and Old) Methodology , 2001, J. Univers. Comput. Sci..

[16]  Daniel P. Lopresti,et al.  Document Analysis Algorithm Contributions in End-to-End Applications: Report on the ICDAR 2011 Contest , 2011, 2011 International Conference on Document Analysis and Recognition.

[17]  Jeff Heflin,et al.  How carefully designed open resource sharing can help and expand document analysis research , 2011, Electronic Imaging.

[18]  Gary James Jason,et al.  The Logic of Scientific Discovery , 1988 .

[19]  Robert M. Haralick,et al.  Special Issue on “Performance Evaluation: Theory, Practice, and Impact” , 2002, International Journal on Document Analysis and Recognition.