Intelligent Parsing of Scanned Volumes for Web Based Archives

The proliferation of digital libraries and the large amount of existing documents raise important issues in efficient handling of documents. Printed texts in documents need to be converted into digital format and semantic information need to be parsed and managed for effective retrieval. In this work, we attempt to solve the problems faced by current web based archives, where large scale repositories of electronic resources have been built from scanned volumes. Specifically, we focus on the scientific domain and target scanned volumes of scientific publications. Our goal is to automate the semantic processing of scanned volumes, an important and challenging step towards efficient retrieval of content within scanned volumes. We tackle the problem by designing a machine learning-based method to extract multi-level metadata about content of scanned volumes. We combine image and text information within scanned volumes for intelligent parsing. We developed a system and test it with real world data from the Internet Archive, and the experimental evaluation has demonstrated good results.

[1]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2]  Chet Smolski,et al.  The Smithsonian Institute , 1978 .

[3]  Sargur N. Srihari,et al.  Knowledge-based derivation of document logical structure , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[6]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[7]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[8]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9]  Hui Han,et al.  A service-oriented architecture for digital libraries , 2004, ICSOC '04.

[10]  Song Mao,et al.  A dynamic feature generation system for automated metadata extraction in preservation of digital materials , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[11]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2006, Inf. Process. Manag..

[12]  James Ze Wang,et al.  Deriving knowledge from figures for digital libraries , 2007, WWW '07.