AUTOMATED GENRE CLASSIFICATION IN THE MANAGEMENT OF DIGITAL DOCUMENTS

This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. The characterisation of digital objects in terms of genre also associates the object to the objectives that led to its creation, which indicates its relevance to new objectives in information search. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments.

[1]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[2]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[3]  Andreas Rauber,et al.  Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.

[4]  Chris Bowerman,et al.  PERC: A Personal Email Classifier , 2006, ECIR.

[5]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[6]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[7]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[8]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[9]  Bonnie Webber,et al.  Implicit reference to citations: a study of astronomy , 2006 .

[10]  Sébastien Adam,et al.  Clustering document images using a bag of symbols representation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[11]  Yunhyong Kim,et al.  Detecting Family Resemblance: Automated Genre Classification , 2007, Data Sci. J..

[12]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[13]  Marcel Worring,et al.  Fine-grained document genre classification using first order random graphs , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[14]  Yunhyong Kim,et al.  Examining Variations of Prominent Features in Genre Classification , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[15]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[16]  Seamus Ross,et al.  Preservation research and sustainable digital libraries , 2005, International Journal on Digital Libraries.

[17]  James R. Curran,et al.  Parsing the WSJ Using CCG and Log-Linear Models , 2004, ACL.

[18]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[19]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Yunhyong Kim,et al.  Genre Classification in Automated Ingest and Appraisal Metadata , 2006, ECDL.

[22]  George R. Thoma Automating the production of bibliographic records for MEDLINE , 2001 .

[23]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.