Automatic Algorithms for Medieval Manuscript Analysis

Massive digital acquisition and preservation of deteriorating historical and artistic documents is of particular importance due to their value and fragile condition. The study and browsing of such digital libraries is invaluable for scholars in the Cultural Heritage field, but requires automatic tools for analyzing and indexing these datasets. We will describe a set of completely automatic solutions to estimate per-page text leading, to extract text lines, blocks and other layout elements, and to perform query-by-example word-spotting on medieval manuscripts. Those techniques have been evaluated on a huge heterogeneous corpus of illuminated medieval manuscripts of different writing styles, languages, image resolutions, amount of illumination and ornamentation, and levels of conservation, with various problematic issues such as holes, spots, ink bleed-through, ornamentation, and background noise. We also present a quantitative analysis to better assess the quality of the proposed algorithms. By not requiring any human intervention to produce a large amount of annotated training data, the developed methods provide Computer Vision researchers and Cultural Heritage practitioners with a compact and efficient system for document analysis.

[1]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[2]  Ying Yang,et al.  An automatic word-spotting framework for medieval manuscripts , 2015, 2015 Digital Heritage.

[3]  Jean-Yves Ramel,et al.  Ancient Printed Documents Indexation: A New Approach , 2005, ICAPR.

[4]  Ying Yang,et al.  Automatic Single Page-Based Algorithms for Medieval Manuscript Analysis , 2017, JOCCH.

[5]  Ying Yang,et al.  Automated color clustering for medieval manuscript analysis , 2015, 2015 Digital Heritage.

[6]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Ying Yang,et al.  A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts , 2014, GCH.

[8]  Ernest Valveny,et al.  Segmentation-free word spotting with exemplar SVMs , 2014, Pattern Recognit..

[9]  Ying Yang,et al.  ATHENA: Automatic text height extraction for the analysis of old handwritten manuscripts , 2013, 2013 Digital Heritage International Congress (DigitalHeritage).

[10]  Ying Yang,et al.  ATHENA: Automatic Text Height Extraction for the Analysis of Text Lines in Old Handwritten Manuscripts , 2015, ACM Journal on Computing and Cultural Heritage.

[11]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[12]  Josep Lladós,et al.  Browsing Heterogeneous Document Collections by a Segmentation-Free Word Spotting Method , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[14]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..