The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification

The main aim of the Carabela project was to develop and apply techniques that allow textual searching on massive Spanish collections of 15th-19th century manuscripts. The project focused on a relatively small subset of 125 000 images of collections of interest to underwater archaeology. For this type of manuscripts, state-of-the-art automatic transcription techniques, generally fail to achieve usable transcription accuracy. Therefore, rather than insisting in actual transcription, methodologies for probabilistic indexing of handwritten text images have been adopted. This has allowed us to effectively cope with the intrinsically high degree of uncertainty of the text contained in most historical manuscripts, leading to highly effective systems for textual search and retrieval. Carabela has gone one step further by developing new techniques to classify probabilistically indexed, but otherwise untranscribed, text images according to their textual content. These techniques have been successfully used to automatically classify Carabela bundels (each containing hundreds or thousands of pages) according to their “level of risk” of public exposure, in order to control their access and avoid as much as possible the plundering of Spanish underwater heritage.

[1]  Alejandro Héctor Toselli Rossi,et al.  Fast HMM-Filler Approach for Key Word Spotting in Handwritten Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  W. Marsden I and J , 2012 .

[5]  Vicente Bosch Campos,et al.  Text Line Extraction Based on Distance Map Features and Dynamic Programming , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[6]  Enrique Vidal,et al.  Probabilistic multi-word spotting in handwritten text images , 2018, Pattern Analysis and Applications.

[7]  Joan Puigcerver I Pérez,et al.  A Probabilistic Formulation of Keyword Spotting , 2018 .

[8]  Moisés Pastor Text baseline detection, a single page trained system , 2019, Pattern Recognit..

[9]  Alejandro Héctor Toselli,et al.  Probabilistic Indexing and Search for Information Extraction on Handwritten German Parish Records , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[10]  Alejandro Héctor Toselli,et al.  Probabilistic interpretation and improvements to the HMM-filler for handwritten keyword spotting , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Alejandro Héctor Toselli,et al.  Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[12]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[13]  Alejandro Héctor Toselli Rossi,et al.  Context-aware lattice based filler approach for key word spotting in handwritten documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[14]  Joan Puigcerver,et al.  Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[15]  Volkmar Frinken,et al.  HMM word graph based keyword spotting in handwritten document images , 2016, Inf. Sci..

[16]  Lorenzo Quirós,et al.  Multi-Task Handwritten Document Layout Analysis , 2018, ArXiv.

[17]  Alejandro Héctor Toselli Rossi,et al.  Text Line Extraction Based on Distance Map Features and Dynamic Programming , 2018, ICFHR.

[18]  Roger Labahn,et al.  READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[19]  Alejandro Héctor Toselli,et al.  Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual Features Through Large-Scale Probabilistic Indexing , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[20]  Alejandro Héctor Toselli,et al.  Modern vs Diplomatic Transcripts for Historical Handwritten Text Recognition , 2019, ICIAP Workshops.

[21]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[22]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[23]  Alejandro Héctor Toselli Rossi,et al.  Two Methods to Improve Confidence Scores for Lexicon-Free Word Spotting in Handwritten Text , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).