Layout Analysis for Historical Manuscripts Using Sift Features

We propose a layout analysis method for historical manuscripts that relies on the part-based identification of layout entities. A layout entity -- such as letters of the text, initials or headings -- is composed of a set of characteristic segments or structures, which is dissimilar for distinct classes in the manuscripts under consideration. This fact is exploited in order to segment a manuscript page into homogeneous regions. Historical documents traditionally involve challenges such as uneven writing support and varying shapes of characters, fluctuating text lines, changing scripts and writing styles, and variance in the layout itself. Hence, a part-based detection of layout entities is proposed using a multi-stage algorithm for the localization of the entities, based on interest points. Results show that the proposed method is able to locate initials, headings and text areas in ancient manuscripts containing stains, tears and partially faded-out ink sufficiently well.

[1]  Lambert Schomaker,et al.  Layout Analysis of Handwritten Historical Documents for Searching the Archive of the Cabinet of the Dutch Queen , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[3]  Jean-Marc Ogier,et al.  Top-down segmentation of ancient graphical drop caps : lettrines , 2005 .

[4]  Adel M. Alimi,et al.  Image analysis for palaeography inspection , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[5]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[6]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  C. Schmid,et al.  Indexing based on scale invariant interest points , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Nicole Vincent,et al.  Drop Caps Decomposition for Indexing a New Letter Extraction Method , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Apostolos Antonacopoulos,et al.  Special issue on the analysis of historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Rita Cucchiara,et al.  Automatic segmentation of digitalized historical manuscripts , 2011, Multimedia Tools and Applications.

[12]  Frank Lebourgeois,et al.  Automatic Metadata Retrieval from Ancient Manuscripts , 2004, Document Analysis Systems.

[13]  Shumeet Baluja,et al.  Finding Images and Line-Drawings in Document-Scanning Systems , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[14]  Jean-Luc Bloechle,et al.  Semi-automatic Annotation Tool for Medieval Manuscripts , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[15]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[16]  Tom Drummond,et al.  Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[17]  D. Massart,et al.  Looking for natural patterns in data: Part 1. Density-based approach , 2001 .