An evaluation of text classification methods for literary study

Text classification methods have been evaluated on topic classification tasks. This thesis extends the empirical evaluation to emotion classification tasks in the literary domain. This study selects two literary text classification problems---the eroticism classification in Dickinson's poems and the sentimentalism classification in early American novels---as two cases for this evaluation. Both problems focus on identifying certain kinds of emotion---a document property other than topic. This study chooses two popular text classification algorithms---naive Bayes and Support Vector Machines (SVM), and three feature engineering options---stemming, stopword removal and statistical feature selection (Odds Ratio and SVM)---as the subjects of evaluation. This study aims to examine the effects of the chosen classifiers and feature engineering options on the two emotion classification problems, and the interaction between the classifiers and the feature engineering options. This thesis seeks empirical answers to the following research questions: (1) is SVM a better classifier than naive Bayes regarding classification accuracy, new literary knowledge discovery and potential for example-based retrieval? (2) is SVM a better feature selection method than Odds Ratio regarding feature reduction rate and classification accuracy improvement? (3) does stop word removal affect the classification performance? (4) does stemming affect the performance of classifiers and feature selection methods? Some of our conclusions are consistent with what are obtained in topic classification, such as Odds Ratio does not improve SVM performance and stop word removal might harm classification. Some conclusions contradict previous results, such as SVM does not beat naive Bayes in both cases. Some findings are new to this area---SVM and naive Bayes select top features in different frequency ranges; stemming might harm feature selection methods. These experiment results provide new insights to the relation between classification methods, feature engineering options and non-topic document properties. They also provide guidance for classification method selection in literary text classification applications.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  R. Macksey,et al.  Museum of Words: The Poetics of Ekphrasis from Homer to Ashbery , 1995 .

[3]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[4]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[5]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[6]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[7]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[8]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[9]  D. Altman,et al.  Multiple significance tests: the Bonferroni method , 1995, BMJ.

[10]  Graeme Hirst,et al.  Collocations as Cues to Semantic Orientation , 2004 .

[11]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[12]  Concept tree based clustering visualization with shaded similarity matrices , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[14]  Jean Guy Meunier,et al.  Categorisation Techniques in Computer-Assisted Reading and Analysis of Texts (CARAT) in the Humanities , 2003, Comput. Humanit..

[15]  Rohini K. Srihari,et al.  Using Verbs and Adjectives to Automatically Classify Blog Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[16]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[17]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[18]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[19]  Harald Krottmaier The Future of Digital Libraries , 2004 .

[20]  Ian H. Witten,et al.  Text mining in a digital library , 2004, International Journal on Digital Libraries.

[21]  Stephen Ramsay,et al.  In Praise of Pattern , 2005 .

[22]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[23]  Tandy J. Warnow,et al.  Analyzing the Order of Items in Manuscripts of The Canterbury Tales , 2003, Computers and the Humanities.

[24]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[25]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[26]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[27]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[28]  Bei Yu,et al.  Sentence recall game: a novel tool for collecting data to discover language usage patterns , 2010, HCOMP '10.

[29]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[30]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[31]  Geoffrey Rockwell,et al.  What is Text Analysis, Really? , 2003, Lit. Linguistic Comput..

[32]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[33]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[34]  Ramakrishnan Srikant,et al.  Mining newsgroups using networks arising from social behavior , 2003, WWW '03.

[35]  Shlomo Argamon,et al.  Toward meaningful computing , 2006, CACM.

[36]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[37]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[38]  Jerome McGann,et al.  Radiant Textuality: Literature after the World Wide Web , 2001 .

[39]  Bei Yu,et al.  A Longitudinal Study of Language and Ideology in Congress , 2010 .

[40]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[41]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[42]  Mark Olsen,et al.  Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie , 2009, Digit. Humanit. Q..

[43]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[44]  Douglas Biber,et al.  Dimensions of Register Variation , 1995 .

[45]  Catherine Plaisant,et al.  Exploring erotics in Emily Dickinson's correspondence with text mining and visual interfaces , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[46]  Bei Yu,et al.  Genre-Based In-Document Content Type Classification , 2003 .

[47]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[48]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[49]  Stefan Kaufmann,et al.  Classifying Party Affiliation from Political Speech , 2008 .

[50]  Hugh Craig Authorial attribution and computational stylistics: if you can tell authors apart, have you learned anything about them? , 1999 .

[51]  Janyce Wiebe,et al.  Learning Subjective Adjectives from Corpora , 2000, AAAI/IAAI.

[52]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[53]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[54]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[55]  Fazli Can,et al.  Change of Writing Style with Time , 2004, Comput. Humanit..

[56]  D. Holmes The Analysis of Literary Style — a Review , 1985 .

[57]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[58]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[59]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[60]  Ido Dagan,et al.  A Corpus-Independent Feature Set for Style-Based Text Categorization , 2003 .

[61]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[62]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[63]  Bei Yu,et al.  Building Folk UMLS: An Approach to Finding Meaning of Folk Terms in Medical Domain , 2010 .

[64]  Janyce Wiebe,et al.  Effects of Adjective Orientation and Gradability on Sentence Subjectivity , 2000, COLING.

[65]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[66]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[67]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[68]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[69]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[70]  Stephen Ramsay Toward an Algorithmic Criticism , 2003 .

[71]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[72]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[73]  Lina Zhou,et al.  Movie Review Mining: a Comparison between Supervised and Unsupervised Classification Approaches , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[74]  Bei Yu,et al.  Exploring the characteristics of opinion expressions for political opinion classification , 2008, DG.O.

[75]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[76]  John D. McGregor,et al.  Getting there from here: a roadmap for software product line adoption , 2006, CACM.

[77]  David Ellis,et al.  The English literature researcher in the age of the Internet , 2005, J. Inf. Sci..

[78]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[79]  John Unsworth,et al.  Toward Discovering Potential Data Mining Applications in Literary Criticism , 2006 .

[80]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[81]  Susan Brewer,et al.  Information storage and retrieval , 1959, ACM '59.

[82]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[83]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[84]  Richard T. Watson,et al.  The Internet and the birth of real consumer power , 2002 .

[85]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[86]  Dawn Archer,et al.  Love - 'a familiar or a devil'? An Exploration of Key Domains in Shakespeare's Comedies and Tragedies , 2009 .

[87]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[88]  Jiawei Han,et al.  Data mining via support vector machines: scalability, applicability, and interpretability , 2004 .

[89]  Jonathon Read,et al.  Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification , 2005, ACL.

[90]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[91]  Ray Siemens,et al.  A companion to digital literary studies , 2007 .

[92]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[93]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[94]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[95]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[96]  Bei Yu,et al.  Collecting legacy corpora from social science research for text mining evaluation , 2010, ASIST.

[97]  Bei Yu,et al.  Strangeness-based feature weighting and classification of gene expression profiles , 2008, SAC '08.

[98]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[99]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.