Using feature construction to avoid large feature spaces in text classification

Feature space design is a critical part of machine learning. This is an especially difficult challenge in the field of text classification, where an arbitrary number of features of varying complexity can be extracted from documents as a preprocessing step. A challenge for researchers has consistently been to balance expressiveness of features with the size of the corresponding feature space, due to issues with data sparsity that arise as feature spaces grow larger. Drawing on past successes utilizing genetic programming in similar problems outside of text classification, we propose and implement a technique for constructing complex features from simpler features, and adding these more complex features into a combined feature space which can then be utilized by more sophisticated machine learning classifiers. Applying this technique to a sentiment analysis problem, we show encouraging improvement in classification accuracy, with a small and constant increase in feature space size. We also show that the features we generate carry far more predictive power than any of the simple features they contain.

[1]  Lourdes Araujo,et al.  Improving Query Expansion with Stemming Terms: A New Genetic Algorithm Approach , 2008, EvoCOP.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Ralf Klinkenberg,et al.  A Hybrid Approach to Feature Selection andGeneration Using an Evolutionary , 2006 .

[4]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features: Research Articles , 2007 .

[5]  Laurence Hirsch,et al.  Evolving Lucene search queries for text classification , 2007, GECCO '07.

[6]  Ingo Mierswa,et al.  A Hybrid Approach to Feature Selection and Generation Using an Evolutionary Algorithm , 2003 .

[7]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[8]  Oscar Cordón,et al.  A review on the application of evolutionary computation to information retrieval , 2003, Int. J. Approx. Reason..

[9]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[10]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[11]  Larry Bull,et al.  Genetic Programming with a Genetic Algorithm for Feature Construction and Selection , 2005, Genetic Programming and Evolvable Machines.

[12]  Eric Nyberg,et al.  Interactive Annotation Learning with Indirect Feature Voting , 2009, NAACL.

[13]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[14]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[15]  Carolyn Penstein Rosé,et al.  Generalizing Dependency Features for Opinion Mining , 2009, ACL.

[16]  Krzysztof Krawiec,et al.  Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks , 2002, Genetic Programming and Evolvable Machines.

[17]  Kazutaka Shimada,et al.  Movie Review Classification Based on a Multiple Classifier , 2007, PACLIC.

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[19]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[20]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[21]  Carolyn Penstein Rosé,et al.  A genetic programming approach for robust language interpretation , 1999 .

[22]  Martin Smith,et al.  The use of genetic programming to build Boolean queries for text retrieval through relevance feedback , 1997, J. Inf. Sci..

[23]  Fernando E. B. Otero,et al.  Genetic Programming for Attribute Construction in Data Mining , 2002, EuroGP.

[24]  Peter R. R. White,et al.  The language of evaluation , 2005 .

[25]  Christine D. Piatko,et al.  Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[26]  Jerzy W. Bala,et al.  Using Learning to Facilitate the Evolution of Features for Recognizing Visual Concepts , 1996, Evolutionary Computation.