Automated Building of OAI Compliant Repository from Legacy Collection

In this paper, we report on our experience with the creation of an automated, human-assisted process to extract metadata from documents in a large (>100,000), dynamically growing collection. Such a collection may be expected to be heterogeneous, both statically heterogeneous (containing documents in a variety of formats) and dynamically heterogeneous (likely to acquire new documents in formats unlike any prior acquisitions). Eventually, we hope to be able to totally automate metadata extraction for 80% of the documents and reduce the time needed to generate the metadata for the remaining documents also by 80%. In this paper, we describe our process of first classifying documents into equivalence classes for which we can then use a rule-based approach to extract metadata. Our rule-based approach differs from others in as far as it separates the rule-interpreting engine from a template of rules. The templates vary among classes but the engine is the same. We have evaluated our approach on a test bed of 7413 randomly selected documents from the DTIC (Defense Technical Information Center) collection with encouraging results. Finally, we describe how we can use this process to generate an OAI (Open Archive Initiatives) ‐ compliant digital library from a stream of incoming documents.

[1]  Jianying Hu,et al.  Document image layout comparison and classification , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Michael Bieber,et al.  A tool for classifying office documents , 1993, Proceedings of 1993 IEEE Conference on Tools with Al (TAI-93).

[3]  Donna Bergmark Automatic Extraction of Reference Linking Information from Online Documents , 2000 .

[4]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[5]  Wang Jun Open Archives Initiative Protocol for Metadata Harvesting , 2005 .

[6]  Ben J Hicks,et al.  World Multiconference on Systemics, Cybernetics and Informatics , 2000 .

[7]  Daniel X. Le,et al.  Automated Labeling Algorithms for Biomedical Document Images , 2003 .

[8]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9]  Francesca Cesarini,et al.  Encoding of modified X-Y trees for document classification , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Kurt Maly,et al.  Archon - A Digital Library that Federates Physics Collections , 2002, Dublin Core Conference.

[11]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[12]  Thomas Kieninger,et al.  Document Structure Analysis Based on Layout and Textual Features , 2000 .

[13]  Kurt Maly,et al.  Arc: an OAI service provider for cross-archive searching , 2001, JCDL '01.

[14]  Xuhong Li,et al.  A document classification and extraction system with learning ability , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).