Genre Classification in Automated Ingest and Appraisal Metadata

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.

[1]  Marina Santini A Shallow Approach To Syntactic Feature Extraction For Genre Classification , 2003 .

[2]  Karl-Hans Bläsius,et al.  Domain oriented information extraction from the Internet , 2003, IS&T/SPIE Electronic Imaging.

[3]  George R. Thoma Automating the production of bibliographic records for MEDLINE , 2001 .

[4]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[5]  John Kunze,et al.  Reference models for digital libraries: actors and roles , 2004, International Journal on Digital Libraries.

[6]  Erik Duval,et al.  Automatic metadata generation , 2007 .

[7]  Chris Bowerman,et al.  PERC: A Personal Email Classifier , 2006, ECIR.

[8]  Seamus Ross,et al.  Preservation research and sustainable digital libraries , 2005, International Journal on Digital Libraries.

[9]  Andreas Rauber DELOS: Network of Excellence on Digital Libraries, with a focus on the Preservation Cluster , 2004, iPRES.

[10]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[11]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[12]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..

[13]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[14]  Ralf Krestel,et al.  ERSS 2005: Coreference-Based Summarization Reloaded , 2005 .

[15]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[16]  Thomas M. Breuel,et al.  An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[17]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[18]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[19]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[20]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[21]  Marcel Worring,et al.  Fine-grained document genre classification using first order random graphs , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[22]  Sébastien Adam,et al.  Clustering document images using a bag of symbols representation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[23]  Seamus Ross,et al.  Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation , 2003 .

[24]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.