Feature Generation for Text Categorization Using World Knowledge

We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing--synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field.

[1]  Luis Alfonso Ureña López,et al.  Integrating Linguistic Resources in TC through WSD , 2001, Comput. Humanit..

[2]  Larry A. Rendell,et al.  Constructive Induction On Decision Trees , 1989, IJCAI.

[3]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[4]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[5]  Shaul Markovitch,et al.  Feature Generation Using General Constructor Functions , 2002, Machine Learning.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  M. Pazzani,et al.  ID2-of-3: Constructive Induction of M-of-N Concepts for Discriminators in Decision Trees , 1991 .

[8]  Ron Bekkerman,et al.  Distributional clustering of words for text categorization , 2003 .

[9]  Christopher J. Matheus,et al.  The Need for Constructive Induction , 1991, ML.

[10]  Sam Scott Feature Engineering for a Symbolic Approach to Text Classification , 1998 .

[11]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[12]  D. Haussler,et al.  Boolean Feature Discovery in Empirical Learning , 1990, Machine Learning.

[13]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[16]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[17]  Yuh-Jyh Hu,et al.  A Wrapper Approach for Constructive Induction , 1996, AAAI 1996.

[18]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[19]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[20]  Daniel Kudenko,et al.  Feature Generation for Sequence Categorization , 1998, AAAI/IAAI.

[21]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[22]  William W. Cohen Automatically Extracting Features for Concept Learning from the Web , 2000, International Conference on Machine Learning.

[23]  C. Matheus A constructive induction framework , 1989, ICML 1989.

[24]  Tom Elliott Fawcett Feature discovery for problem solving systems , 1993 .

[25]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[26]  Marko Grobelnik,et al.  Interaction of Feature Selection Methods and Linear Classification Models , 2002 .