Making Large Collections of Handwritten Material Easily Accessible and Searchable

Libraries and cultural organisations contain a rich amount of digitised historical handwritten material in the form of scanned images. A vast majority of this material has not been transcribed yet, owing to technological challenges and lack of expertise. This renders the task of making these historical collections available for public access challenging, especially in performing a simple text search across the collection. Machine learning based methods for handwritten text recognition are gaining importance these days, which require huge amount of pre-transcribed texts for training the system. However, it is impractical to have access to several thousands of pre-transcribed documents due to adversities transcribers face. Therefore, this paper presents a training-free word spotting algorithm as an alternative for handwritten text transcription, where case studies on Alvin (Swedish repository) and Clavius on the Web are presented. The main focus of this work is on discussing prospects of making materials in the Alvin platform and Clavius on the Web easily searchable using a word spotting based handwritten text recognition system.

[1]  Alejandro Héctor Toselli Rossi,et al.  Multimodal Interactive Handwritten Text Transcription , 2012, Series in Machine Perception and Artificial Intelligence.

[2]  Alfons Juan-Císcar,et al.  Adaptation from partially supervised handwritten text transcriptions , 2009, ICMI-MLMI '09.

[3]  Vicente Bosch,et al.  A Historical Document Handwriting Transcription End-to-end System , 2017, IbPRIA.

[4]  Justin Tonra,et al.  Manuscript Transcription by Crowdsourcing: Transcribe Bentham , 2011 .

[5]  Anders Hast,et al.  Radial Line Fourier Descriptor for Handwritten Word Representation , 2017, ArXiv.

[6]  Konstantinos Zagoris,et al.  Unsupervised Word Spotting in Historical Handwritten Document Images Using Document-Oriented Local Features , 2017, IEEE Transactions on Image Processing.

[7]  Alejandro Héctor Toselli,et al.  Interactive layout analysis and transcription systems for historic handwritten documents , 2010, DocEng '10.

[8]  Anders Hast,et al.  Automatic Document Image Binarization using Bayesian Optimization , 2017, HIP@ICDAR.

[9]  Andrea Marchetti,et al.  Sharing Cultural Heritage: the Clavius on the Web Project , 2014, LREC.

[10]  Andrea Marchetti,et al.  Text Encoder and Annotator: an all-in-one Editor for Transcribing and Annotating Manuscripts with RDF , 2016, SWASH@ESWC.

[11]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[12]  Andrea Marchetti,et al.  When Traditional Ontologies are not Enough: Modelling and Visualizing Dynamic Ontologies in Semantic-Based Access to Texts , 2016, DH.

[13]  Alfons Juan-Císcar,et al.  Active learning strategies for handwritten text transcription , 2010, ICMI-MLMI '10.

[14]  Frank Lebourgeois,et al.  Towards an omnilingual word retrieval system for ancient manuscripts , 2009, Pattern Recognit..

[15]  Andrea Marchetti,et al.  An Efficient Preconditioner and a Modified RANSAC for Fast and Robust Feature Matching. , 2012 .

[16]  Alicia Fornés,et al.  A Segmentation-Free Handwritten Word Spotting Approach by Relaxed Feature Matching , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[17]  Andrea Marchetti,et al.  The Clavius on the Web Project: Digitization, Annotation and Visualization of Early Modern Manuscripts , 2014, AIUCD '14.