Active clustering of document fragments using information derived from both images and catalogs

Many significant historical corpora contain leaves that are mixed up and no longer bound in their original state as multi-page documents. The reconstruction of old manuscripts from a mix of disjoint leaves can therefore be of paramount importance to historians and literary scholars. Previously, it was shown that visual similarity provides meaningful pair-wise similarities between handwritten leaves. Here, we go a step further and suggest a semiautomatic clustering tool that helps reconstruct the original documents. The proposed solution is based on a graphical model that makes inferences based on catalog information provided for each leaf as well as on the pairwise similarities of handwriting. Several novel active clustering techniques are explored, and the solution is applied to a significant part of the Cairo Genizah, where the problem of joining leaves remains unsolved even after a century of extensive study by hundreds of scholars.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Lior Wolf,et al.  Identifying Join Candidates in the Cairo Genizah , 2011, International Journal of Computer Vision.

[3]  Daphne Koller,et al.  Active Learning for Structure in Bayesian Networks , 2001, IJCAI.

[4]  Wai Lam,et al.  An active learning framework for semi-supervised document clustering with language modeling , 2009, Data Knowl. Eng..

[5]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Trevor Darrell,et al.  Autotagging Facebook: Social network context improves photo annotation , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[7]  Vladimir Kolmogorov,et al.  Feature Correspondence Via Graph Matching: Models and Global Optimization , 2008, ECCV.

[8]  Joris M. Mooij,et al.  libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models , 2010, J. Mach. Learn. Res..

[9]  N. Dershowitz,et al.  Automatic Palaeographic Exploration ofGenizah Manuscripts , 2011 .

[10]  Luc Van Gool,et al.  Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[11]  Jieh Hsiang,et al.  On Building a Full-Text Digital Library of Historical Documents , 2007, ICADL.

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[14]  Daphne Koller,et al.  Active Learning for Parameter Estimation in Bayesian Networks , 2000, NIPS.

[15]  Cordelia Schmid,et al.  Is that you? Metric learning approaches for face identification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Constantin Papaodysseus,et al.  Automatic Writer Identification of Ancient Greek Inscriptions , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Zoubin Ghahramani,et al.  Choosing a Variable to Clamp: Approximate Inference Using Conditioned Belief Propagation , 2009 .

[18]  Nikos Komodakis,et al.  MRF Optimization via Dual Decomposition: Message-Passing Revisited , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Tomer Hertz,et al.  Pairwise Clustering and Graphical Models , 2003, NIPS.