\textitTexT TexT - Text Extractor Tool for Handwritten Document Transcription and Annotation

This paper presents a framework for semi-automatic transcription of large-scale historical handwritten documents and proposes a simple user-friendly text extractor tool, \(\textit{TexT}\) for transcription. The proposed approach provides a quick and easy transcription of text using computer assisted interactive technique. The algorithm finds multiple occurrences of the marked text on-the-fly using a word spotting system. \(\textit{TexT}\) is also capable of performing on-the-fly annotation of handwritten text with automatic generation of ground truth labels, and dynamic adjustment and correction of user generated bounding box annotations with the word being perfectly encapsulated. The user can view the document and the found words in the original form or with background noise removed for easier visualization of transcription results. The effectiveness of \(\textit{TexT}\) is demonstrated on an archival manuscript collection from well-known publicly available dataset.

[1]  Tapas Kanungo,et al.  TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR , 2000, IS&T/SPIE Electronic Imaging.

[2]  Anders Hast,et al.  Automatic Document Image Binarization , 2017, ArXiv.

[3]  Alicia Forn BH2M: the Barcelona Historical Handwritten Marriages database , 2014 .

[4]  Alfons Juan-Císcar,et al.  Adaptation from partially supervised handwritten text transcriptions , 2009, ICMI-MLMI '09.

[5]  Basilios Gatos,et al.  A survey of document image word spotting techniques , 2017, Pattern Recognit..

[6]  Michael Kuperberg,et al.  Markov Models , 2017, Arch. Formal Proofs.

[7]  Alejandro Héctor Toselli Rossi,et al.  Semiautomatic Text Baseline Detection in Large Historical Handwritten Documents , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[8]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[9]  Mark Hedges,et al.  Open source optical character recognition for historical research , 2012, J. Documentation.

[10]  K. D. Borne,et al.  The Zooniverse: A Framework for Knowledge Discovery from Citizen Science Data , 2011 .

[11]  Alicia Fornés,et al.  A Segmentation-Free Handwritten Word Spotting Approach by Relaxed Feature Matching , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[12]  Anders Hast,et al.  On-the-fly Historical Handwritten Text Annotation , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[13]  Salvador España Boquera,et al.  Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jing Lin,et al.  PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[15]  Marcus Liwicki,et al.  The use of Gabor features for semi-automatically generated polyon-based ground truth of historical document images , 2017, Digit. Scholarsh. Humanit..

[16]  Sébastien Adam,et al.  Automatic Ground-truth Generation for Document Image Analysis and Understanding , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[17]  Alejandro Héctor Toselli Rossi,et al.  Multimodal Interactive Handwritten Text Transcription , 2012, Series in Machine Perception and Artificial Intelligence.

[18]  Sherif M. Yacoub,et al.  PerfectDoc: a ground truthing environment for complex documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[19]  Its'hak Dinstein,et al.  WebGT: An Interactive Web-Based System for Historical Document Ground Truth Generation , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[20]  Andrea Marchetti,et al.  Text Encoder and Annotator: an all-in-one Editor for Transcribing and Annotating Manuscripts with RDF , 2016, SWASH@ESWC.

[21]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Alfons Juan-Císcar,et al.  Active learning strategies for handwritten text transcription , 2010, ICMI-MLMI '10.

[23]  Vicente Bosch,et al.  A Historical Document Handwriting Transcription End-to-end System , 2017, IbPRIA.

[24]  Sébastien Adam,et al.  Automatic Ground-truth Generation for Document Image Analysis and Understanding , 2007 .

[25]  Justin Tonra,et al.  Manuscript Transcription by Crowdsourcing: Transcribe Bentham , 2011 .

[26]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[27]  Alejandro Héctor Toselli,et al.  Interactive layout analysis and transcription systems for historic handwritten documents , 2010, DocEng '10.

[28]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[29]  Sabri A. Mahmoud,et al.  Recognition : A Survey , 2013 .

[30]  Horst Bunke Hidden Markov Models , 2001 .

[31]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[32]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[33]  Hiromitsu Yamada,et al.  Optical Character Recognition , 1999 .

[34]  Apostolos Antonacopoulos,et al.  Ground Truth for Layout Analysis Performance Evaluation , 2006, Document Analysis Systems.

[35]  Alicia Fornés,et al.  The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition , 2013, Pattern Recognit..

[36]  Anders Hast,et al.  Automatic Document Image Binarization using Bayesian Optimization , 2017, HIP@ICDAR.

[37]  Michael S. Bernstein,et al.  The future of crowd work , 2013, CSCW.

[38]  Alejandro Héctor Toselli,et al.  Handwritten Text Recognition Results on the Bentham Collection with Improved Classical N-Gram-HMM methods , 2015, HIP@ICDAR.