Learning Structure and Schemas from Documents

The rapidly growing volume of available digital documents of various formats and the possibility to access these through Internet-based technologies, have led to the necessity to develop solid methods to properly organize and structure documents in large digital libraries and repositories. Due to the extremely large volumes of documents and to their unstructured form, most of the research efforts in this direction are dedicated to automatically infer structure and schemas that can help to better organize huge collections of documents and data. This book covers the latest advances in structure inference in heterogeneous collections of documents and data. The book brings a comprehensive view of the state-of-the-art in the area, presents some lessons learned and identifies new research issues, challenges and opportunities for further research agenda and developments. The selected chapters cover a broad range of research issues, from theoretical approaches to case studies and best practices in the field. Researcher, software developers, practitioners and students interested in the field of learning structure and schemas from documents will find the comprehensive coverage of this book useful for their research, academic, development and practice activity.

[1]  Josep Lladós,et al.  Indexing Historical Documents by Word Shape Signatures , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[3]  Véronique Eglin,et al.  Curvelets Based Queries for CBIR Application in Handwriting Collections , 2007 .

[4]  Lisa M. Brown,et al.  A survey of image registration techniques , 1992, CSUR.

[5]  Salvatore Tabbone,et al.  Symbol Descriptor Based on Shape Context and Vector Model of Information Retrieval , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[6]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[7]  Josep Lladós,et al.  Symbol Spotting in Technical Drawings Using Vectorial Signatures , 2005, GREC.

[8]  Wayne Niblack,et al.  An introduction to digital image processing , 1986 .

[9]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[10]  Mohamed Cheriet,et al.  Application of Multi-Level Classifiers and Clustering for Automatic Word Spotting in Historical Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[11]  Felix Naumann,et al.  Detecting Duplicates in Complex XML Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Boaz J. Super Retrieval from Shape Databases Using Chance Probability Functions and Fixed Correspondence , 2006, Int. J. Pattern Recognit. Artif. Intell..

[13]  Jean Camillerapp,et al.  Making handwritten archives documents accessible to public with a generic system of document image analysis , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[14]  Josep Lladós,et al.  A Region-Based Hashing Approach for Symbol Spotting in Technical Documents , 2007, GREC.

[15]  Yasser El-Sonbaty,et al.  Document image matching using a maximal grid approach , 2001, IS&T/SPIE Electronic Imaging.

[16]  Venu Govindaraju,et al.  Handwritten document retrieval strategies , 2009, AND '09.

[17]  Karen Spärck Jones,et al.  Video mail retrieval: the effect of word spotting accuracy on precision , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[18]  Zao Liu,et al.  Content-Based Information Retrieval and Digital Libraries , 2008 .

[19]  Giovanni Soda,et al.  Font adaptive word indexing of modern printed documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Giovanni Soda,et al.  Bag of Characters and SOM Clustering for Script Recognition and Writer Identification , 2010, 2010 20th International Conference on Pattern Recognition.

[21]  José A. Rodríguez-Serrano,et al.  Handwritten Word Image Retrieval with Synthesized Typed Queries , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[22]  Masakazu Iwamura,et al.  Real-Time Retrieval for Images of Documents in Various Languages Using a Web Camera , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[23]  Jianying Hu,et al.  Comparison and Classification of Documents Based on Layout Similarity , 2000, Information Retrieval.

[24]  Te-Feng Su,et al.  Shape-Based Image Retrieval Using Two-Level Similarity Measures , 2007, Int. J. Pattern Recognit. Artif. Intell..

[25]  Josep Lladós,et al.  Word and Symbol Spotting Using Spatial Organization of Local Descriptors , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[26]  Francesca Cesarini,et al.  A general system for the retrieval of document images from digital libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[27]  Alan F. Smeaton,et al.  Using character shape coding for information retrieval , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[28]  Sargur N. Srihari,et al.  Word image retrieval using binary features , 2003, IS&T/SPIE Electronic Imaging.

[29]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  F. Perronnin,et al.  Local gradient histogram features for word spotting in unconstrained handwritten documents , 2008 .

[31]  Venu Govindaraju,et al.  Separating text and background in degraded document images - a comparison of global thresholding techniques for multi-stage thresholding , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[32]  Ernest Valveny,et al.  A Rotation Invariant Page Layout Descriptor for Document Classification and Retrieval , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[33]  Chew Lim Tan,et al.  A Fast Keyword-Spotting Technique , 2007 .

[34]  Giovanni Soda,et al.  Layout based document image retrieval by means of XY tree reduction , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[35]  Adel M. Alimi,et al.  An Ancient Graphic Documents Indexing Method Based on Spatial Similarity , 2007, GREC.

[36]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[37]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[38]  Jean-Yves Ramel,et al.  A Proposition of Retrieval Tools for Historical Document Images Libraries , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[39]  B. J. Tepping A Model for Optimum Linkage of Records , 1968 .

[40]  Melanie Herschel,et al.  Space and Time Scalability of Duplicate Detection in Graph Data , 2008 .

[41]  Alexandra Psarrou,et al.  Revealing the Visually Unknown in Ancient Manuscripts with a Similarity Measure for IR-Imaged Inks , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[42]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[43]  Giovanni Soda,et al.  Efficient Word Retrieval by Means of SOM Clustering and PCA , 2006, Document Analysis Systems.

[44]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[45]  Shuang Liang,et al.  Sketch retrieval and relevance feedback with biased SVM classification , 2008, Pattern Recognit. Lett..

[46]  Sergios Theodoridis,et al.  Keyword-guided word spotting in historical printed documents using synthetic data and user feedback , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[47]  Edward M. Riseman,et al.  Word spotting: a new approach to indexing handwriting , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Keinosuke Matsumoto,et al.  Document image retrieval based on 2D density distributions of terms with pseudo relevance feedback , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[49]  Nicole Vincent,et al.  Fusion of Word Spotting and Spatial Information for Figure Caption Retrieval in Historical Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[50]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[51]  Giovanni Soda,et al.  Mathematical Symbol Indexing Using Topologically Ordered Clusters of Shape Contexts , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[52]  Simone Marinai Text retrieval from early printed books , 2009, AND '09.

[53]  Joaquim A. Jorge,et al.  Generic Shape Classification for Retrieval , 2005, GREC.

[54]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[55]  Ulrich Eckhardt,et al.  Shape descriptors for non-rigid shapes with a single closed contour , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[56]  Josep Lladós,et al.  Symbol Spotting in Digital Libraries - Focused Retrieval over Graphic-rich Document Collections , 2010 .

[57]  Nikos Papamarkos,et al.  An Evaluation Technique for Binarization Algorithms , 2008, J. Univers. Comput. Sci..

[58]  Kengo Terasawa,et al.  Eigenspace method for text retrieval in historical document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[59]  Ergina Kavallieratou A binarization algorithm specialized on document images and photos , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[60]  Shijian Lu,et al.  Retrieval of machine-printed Latin documents through Word Shape Coding , 2008, Pattern Recognit..

[61]  C. V. Jawahar,et al.  Matching word images for content-based retrieval from printed document images , 2008, International Journal of Document Analysis and Recognition (IJDAR).

[62]  Henry S. Baird Difficult and urgent open problems in document image analysis for libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[63]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[64]  Ioannis Pratikakis,et al.  Segmentation-free Word Spotting in Historical Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[65]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[66]  Florent Perronnin,et al.  Universal and Adapted Vocabularies for Generic Visual Categorization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  John Garrett,et al.  Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. , 1996 .

[68]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[69]  Simone Marinai,et al.  A Survey of Document Image Retrieval in Digital Libraries , 2006 .

[70]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[71]  Ioannis Pratikakis,et al.  Accessing the content of Greek historical documents , 2009, AND '09.

[72]  Jean-Marc Ogier,et al.  Segmentation and Retrieval of Ancient Graphic Documents , 2005, GREC.

[73]  Mathias Lux,et al.  Bag of visual words revisited: an exploratory study on robust image retrieval exploiting fuzzy codebooks , 2010, MDMKDD '10.

[74]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[75]  Liu Wenyin,et al.  A New Vectorial Signature for Quick Symbol Indexing, Filtering and Recognition , 2007 .

[76]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[77]  Anil K. Jain,et al.  Indexing and retrieval of on-line handwritten documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[78]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[79]  Yue Lu,et al.  Retrieving imaged documents in digital libraries based on word image coding , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[80]  Lambert Schomaker Retrieval of Handwritten Lines in Historical Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[81]  Thomas Risse,et al.  Using word sense discrimination on historic document collections , 2010, JCDL '10.

[82]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[83]  Wilson S. Geisler,et al.  Image quality assessment based on a degradation model , 2000, IEEE Trans. Image Process..

[84]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[85]  Jean-Yves Ramel,et al.  Spotting Symbols in Line Drawing Images Using Graph Representations , 2007, GREC.

[86]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[87]  Frank Lebourgeois,et al.  Text search for medieval manuscript images , 2007, Pattern Recognit..

[88]  Giovanni Soda,et al.  Tree clustering for layout-based document image retrieval , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[89]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[90]  Christian Viard-Gaudin,et al.  Information Retrieval Model for Online Handwritten Script Identification , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[91]  Shijian Lu,et al.  Document Image Retrieval through Word Shape Coding , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  S. Lu,et al.  Keyword Spotting and Retrieval of Document Images Captured by a Digital Camera , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[93]  Chang-Tsun Li,et al.  Trademark image retrieval using synthetic features for describing global shape and interior structure , 2009, Pattern Recognit..

[94]  Ian H. Witten,et al.  How to Build a Digital Library , 2002 .

[95]  张育,et al.  Improved Approximate Detection of Duplicates for Data Streams Over Sliding Windows , 2008 .