Reproducible Research in Document Analysis and Recognition

With reproducible research becoming a de facto standard in computational sciences, many approaches have been explored to enable researchers in other disciplines to adopt this standard. In this paper, we explore the importance of reproducible research in the field of document analysis and recognition and in the Computer Science field as a whole. First, we report on the difficulties that one can face in trying to reproduce research in the current publication standards. These difficulties for a large percentage of research may include missing raw or original data, a lack of tidied up version of the data, no source code available, or lacking the software to run the experiment. Furthermore, even when we have all these tools available, we found it was not a trivial task to replicate the research due to lack of documentation and deprecated dependencies. In this paper, we offer a solution to these reproducible research issues by utilizing container technologies such as Docker. As an example, we revisit the installation and execution of OCRSpell which we reported on and implemented in 1994. While the code for OCRSpell is freely available on github, we continuously get emails from individuals who have difficulties compiling and using it in modern hardware platforms. We walk through the development of an OCRSpell Docker container for creating an image, uploading such an image, and enabling others to easily run this program by simply downloading the image and running the container.

[1]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[2]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[3]  Ka Yee Yeung,et al.  GUIdock: Using Docker Containers with a Common Graphics User Interface to Address the Reproducibility of Research , 2016, PloS one.

[4]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[5]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[6]  Christian Collberg,et al.  Measuring Reproducibility in Computer Systems Research , 2014 .

[7]  Reinhard C. Laubenbacher,et al.  AlgoRun: a Docker-based packaging system for platform-agnostic implemented algorithms , 2016, Bioinform..

[8]  Sam Yeaman,et al.  Mandated data archiving greatly improves access to research data , 2013, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[9]  Alexander Sczyrba,et al.  Bioboxes: standardised containers for interchangeable bioinformatics software , 2015, GigaScience.

[10]  Nick Barnes Publish your computer code: it is good enough , 2010, Nature.

[11]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[12]  László Kovács,et al.  Learning string distance with smoothing for OCR spelling correction , 2017, Multimedia Tools and Applications.

[13]  Pablo Prieto,et al.  The impact of Docker containers on the performance of genomic pipelines , 2015, PeerJ.

[14]  Jeffrey T. Leek,et al.  Opinion: Reproducible research can still be wrong: Adopting a prevention approach , 2015, Proceedings of the National Academy of Sciences.

[15]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[16]  Andrea C. Arpaci-Dusseau,et al.  The Role of Container Technology in Reproducible Computer Systems Research , 2015, 2015 IEEE International Conference on Cloud Engineering.

[17]  Karthik Ram,et al.  Git can facilitate greater reproducibility and increased transparency in science , 2013, Source Code for Biology and Medicine.