The Form is the Substance: Classification of Genres in Text

Categorization of text in IR has traditionally focused on topic. As use of the Internet and e-mail increases, categorization has become a key area of research as users demand methods of prioritizing documents. This work investigates text classification by format style, i.e. "genre", and demonstrates, by complementing topic classification, that it can significantly improve retrieval of information. The paper compares use of presentation features to word features, and the combination thereof, using Naive Bayes, C4.5 and SVM classifiers. Results show use of combined feature sets with SVM yields 92% classification accuracy in sorting seven genres.

[1]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[4]  Christos Faloutsos,et al.  On automatic filtering of multilingual texts , 1994, Proceedings of IEEE International Conference on Systems, Man and Cybernetics.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[8]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[9]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[11]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[12]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[13]  Andrew D. May Automatic Classification of E-Mail Messages by Message Type , 1997, J. Am. Soc. Inf. Sci..

[14]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Andrew D. May Automatic classification of e-mail messages by messages type , 1997 .

[18]  Katia Sycara,et al.  Learning Text Filtering Preferences , 1996 .

[19]  M. Crawford The Art of Readable Writing , 1969 .

[20]  Pierre Lafon,et al.  TyPTex: generic features for text profiler , 2000 .

[21]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[22]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.