Ground-Truth Production in the Transcriptorium Project

Tran Scriptorium is a 3-years project that aims to develop innovative, cost-effective solutions for the indexing, search and full transcription of historical handwritten document images, using Handwritten Text Recognition (HTR) technology. The production of ground-truth (GT) of a dataset of handwritten document images is among the first tasks. We address novel approaches for the faster production of this GT based on crowd-sourcing and on prior-knowledge methods. We also address here a novel low-cost semi-supervised procedure for obtaining pairs of correct line-level aligned detected/extracted text line images and text line transcripts, specially suitable for training models of the HTR technology employed in Tran Scriptorium.

[1]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[2]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[3]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[4]  Melissa Terras,et al.  “Many hands make light work. Many hands together make merry work”: Transcribe Bentham and crowdsourcing manuscript collections , 2014 .

[5]  Jeremy Bentham “ Many hands make light work . Many hands together make merry work ” : , 2015 .

[6]  Hermann Ney,et al.  Integrated Handwriting Recognition And Interpretation Using Finite-State Models , 2004, Int. J. Pattern Recognit. Artif. Intell..

[7]  Alejandro Héctor Toselli Rossi,et al.  Natural Language Inspired Approach for Handwritten Text Line Detection in Legacy Documents , 2012, LaTeCH@EACL.

[8]  Véronique Eglin,et al.  A Mixed Approach for Handwritten Documents Structural Analysis , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Richard M. Davis,et al.  tranScriptorium: a european project on handwritten text recognition , 2013, ACM Symposium on Document Engineering.

[10]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Ioannis Pratikakis,et al.  Text line and word segmentation of handwritten documents , 2009, Pattern Recognit..

[12]  Tim Causer,et al.  Building A Volunteer Community: Results and Findings from Transcribe Bentham , 2012, Digit. Humanit. Q..

[13]  Justin Tonra,et al.  Transcription maximized; expense minimized? Crowdsourcing and editing The Collected Works of Jeremy Bentham , 2012, Lit. Linguistic Comput..