Examining Variations of Prominent Features in Genre Classification

This paper investigates the correlation between features of three types (visual, stylistic and topical types) and genre classes. The majority of previous studies in automated genre classification have created models based on an amalgamated representation of a document using a combination of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. In this paper we use classifiers independently modeled on three groups of features to examine six genre classes to show that the strongest features for making one classification is not necessarily the best features for carrying out another classification.

[1]  Andrew Dillon,et al.  â Itâ s the journey and the destinationâ : Shape and the emergent property of genre in evaluating digital documents , 1997 .

[2]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[3]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[4]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[5]  Sébastien Adam,et al.  Clustering document images using a bag of symbols representation , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[6]  Yunhyong Kim,et al.  Detecting Family Resemblance: Automated Genre Classification , 2007, Data Sci. J..

[7]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[10]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[11]  Marcel Worring,et al.  Fine-grained document genre classification using first order random graphs , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[12]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[13]  Elaine Toms,et al.  Genre as interface metaphor: exploiting form and function in digital environments , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[14]  Hongying Lilian Tang,et al.  Improved computation of beliefs based on confusion matrix for combining multiple classifiers , 2004 .

[15]  Yunhyong Kim,et al.  Genre Classification in Automated Ingest and Appraisal Metadata , 2006, ECDL.

[16]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[17]  Andreas Rauber,et al.  Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.