Text categorization by boosting automatically extracted concepts

Term-based representations of documents have found wide-spread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage.In this paper we investigate the use of concept-based document representations to supplement word- or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks confirm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble.

[1]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[4]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[5]  Susan T. Dumais,et al.  Using LSI for information filtering: TREC-3 experiments , 1995 .

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[10]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[11]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[12]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[13]  Thomas Hofmann,et al.  ProbMap - A probabilistic approach for mapping large document collections , 2000, Intell. Data Anal..

[14]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[15]  Nello Cristianini,et al.  Learning Semantic Similarity , 2002, NIPS.

[16]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[17]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[18]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.