New public dataset for spotting patterns in medieval document images

Abstract. With advances in technology, a large part of our cultural heritage is becoming digitally available. In particular, in the field of historical document image analysis, there is now a growing need for indexing and data mining tools, thus allowing us to spot and retrieve the occurrences of an object of interest, called a pattern, in a large database of document images. Patterns may present some variability in terms of color, shape, or context, making the spotting of patterns a challenging task. Pattern spotting is a relatively new field of research, still hampered by the lack of available annotated resources. We present a new publicly available dataset named DocExplore dedicated to spotting patterns in historical document images. The dataset contains 1500 images and 1464 queries, and allows the evaluation of two tasks: image retrieval and pattern localization. A standardized benchmark protocol along with ad hoc metrics is provided for a fair comparison of the submitted approaches. We also provide some first results obtained with our baseline system on this new dataset, which show that there is room for improvement and that should encourage researchers of the document image analysis community to design new systems and submit improved results.

[1]  Eamonn J. Keogh,et al.  Mother Fugger: Mining Historical Manuscripts with Local Color Patches , 2010, 2010 IEEE International Conference on Data Mining.

[2]  Eamonn J. Keogh,et al.  Establishing the provenance of historical manuscripts with a novel distance measure , 2013, Pattern Analysis and Applications.

[3]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Laurent Heutte,et al.  Spot It! Finding Words and Patterns in Historical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Frédéric Jurie,et al.  Region Proposal for Pattern Spotting in Historical Document Images , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[6]  Eamonn J. Keogh,et al.  Mining Historical Documents for Near-Duplicate Figures , 2011, 2011 IEEE 11th International Conference on Data Mining.

[7]  Michael C. Fairhurst,et al.  DocExplore: overcoming cultural and physical barriers to access ancient documents , 2012, DocEng '12.

[8]  Edward M. Riseman,et al.  Word spotting: a new approach to indexing handwriting , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Eamonn J. Keogh,et al.  Image Mining of Historical Manuscripts to Establish Provenance , 2012, SDM.

[10]  Caroline Petitjean,et al.  Segmentation-free pattern spotting in historical document images , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Konstantinos Zagoris,et al.  Segmentation-Based Historical Handwritten Word Spotting Using Document-Specific Local Features , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[12]  Frédéric Jurie,et al.  Pattern localization in historical document images via template matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[13]  Josep Lladós,et al.  Efficient segmentation-free keyword spotting in historical document collections , 2015, Pattern Recognit..

[14]  Umapada Pal,et al.  A symbol spotting approach in graphical documents by hashing serialized graphs , 2013, Pattern Recognit..

[15]  Caroline Petitjean,et al.  A scalable pattern spotting system for historical documents , 2016, Pattern Recognit..

[16]  Alejandro Héctor Toselli,et al.  ICDAR2015 Competition on Keyword Spotting for Handwritten Documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[17]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2019, Computational Visual Media.