Comprehensive and Scalable Appraisals of Contemporary Documents

This book chapter describes problems related to contemporary document analyses. Contemporary documents contain multiple digital objects of different type. These digital objects have to be extracted from document containers, represented as data structures, and described by features suitable for comparing digital objects. In many archival and machine learning applications, documents are compared by using multiple metrics, checked for integrity and authenticity, and grouped based on similarity. The objective of our book chapter is to describe methodologies for contemporary document processing, visual exploration, grouping and integrity verification, as well as to include computational scalability challenges and solutions.

[1]  Remco C. Veltkamp,et al.  Part-based shape retrieval , 2005, MULTIMEDIA '05.

[2]  Robert P. Futrelle,et al.  Recognition and Classification of Figures in PDF Documents , 2005, GREC.

[3]  Chew Lim Tan,et al.  Model-Based Chart Image Recognition , 2003, GREC.

[4]  Peter Bajcsy,et al.  Text, Image and Vector Graphics Based Appraisal of Contemporary Documents , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[5]  Anjo Anjewierden AIDAS: incremental logical structure discovery in PDF documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  Remco C. Veltkamp,et al.  Properties and Performance of Shape Similarity Measures , 2006, Data Science and Classification.

[7]  Brian H. Mayoh,et al.  Graphics and the Understanding of Perceptual Mechanisms: Analogies and Similarities , 2006, Geometric Modeling and Imaging--New Trends (GMAI'06).

[8]  James Allan,et al.  Automatic structuring and retrieval of large text files , 1994, CACM.

[9]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[10]  Remco C. Veltkamp,et al.  Shape matching: similarity measures and algorithms , 2001, Proceedings International Conference on Shape Modeling and Applications.

[11]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Vladimir Batagelj,et al.  Data Science and Classification , 2006, Studies in Classification, Data Analysis, and Knowledge Organization.

[13]  Jennifer Alycen Marshall,et al.  Accounting for Disposition: A Comparative Case Study of Appraisal Documentation at the National Archives and Records Administration in the United States, Library and Archives Canada, and the National Archives of Australia , 2007 .

[14]  David F. Brailsford,et al.  Document analysis of PDF files: methods, results and implications , 1995 .

[15]  Thierry Pun,et al.  Content-based query of image databases: inspirations from text retrieval , 2000, Pattern Recognit. Lett..