Searching historical manuscripts for near-duplicate figures

In the next decade a majority of all the books ever published will be digitized and online. Naturally, most of the data in historical manuscripts is text, but there is also a large amount devoted to images. This observation is responsible for the dramatic increase in interest in query-by-content systems for historical documents. While querying/indexing systems can be useful, we believe that this domain is finally ready for unsupervised discovery of patterns. With this in mind, we introduce an efficient and scalable technique that can detect approximately repeated occurrences of images both within and between historical texts. We demonstrate that this ability to find repeated shapes allows us to do automatic annotation of manuscripts. We show the utility of our technique on datasets dating back to the fourteenth century.

[1]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[2]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[3]  William Smith,et al.  A synopsis of the British diatomaceæ; with remarks on their structure, functions and distribution; and instructions for collecting and preserving specimens , 1853 .

[4]  Eamonn J. Keogh,et al.  Augmenting the generalized hough transform to enable the mining of petroglyphs , 2009, KDD.

[5]  Josep Lladós,et al.  An Incremental Parser to Recognize Diagram Symbols and Gestures Represented by Adjacency Grammars , 2005, GREC.

[6]  Ioannis Pratikakis,et al.  An Adaptive Binarization Technique for Low Quality Historical Documents , 2004, Document Analysis Systems.

[7]  Ernest Valveny,et al.  A Platform to Extract Knowledge from Graphic Documents. Application to an Architectural Sketch Understanding Scenario , 2004, Document Analysis Systems.

[8]  Zhuowen Tu,et al.  Learning Context-Sensitive Shape Similarity by Graph Transduction , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Pavlos Protopapas,et al.  Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures , 2008, The VLDB Journal.

[10]  Alicia Fornés,et al.  Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method , 2008, GREC.

[11]  Efstathios Stamatatos,et al.  Adaptive Binarization of Historical Document Images , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[12]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[13]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[14]  Eamonn J. Keogh,et al.  Finding Time Series Motifs in Disk-Resident Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[15]  Eamonn J. Keogh,et al.  Finding Motifs in a Database of Shapes , 2007, SDM.

[16]  Haim J. Wolfson,et al.  Geometric hashing: an overview , 1997 .