On the Importance of Parameter Tuning in Text Categorization

Text Categorization algorithms have a large number of parameters that determine their behaviour, whose effect is not easily predicted objectively or intuitively and may very well depend on the corpus or on the document representation. Their values are usually taken over from previously published results, which may lead to less than optimal accuracy in experimenting on particular corpora. In this paper we investigate the effect of parameter tuning on the accuracy of two Text Categorization algorithms: the well-known Rocchio algorithm and the lesser-known Winnow. We show that the optimal parameter values for a specific corpus are sometimes very different from those found in literature. We show that the effect of individual parameters is corpus-dependent, and that parameter tuning can greatly improve the accuracy of both Winnow and Rocchio. We argue that the dependence of the categorization algorithms on experimentally established parameter values makes it hard to compare the outcomes of different experiments and propose the automatic determination of optimal parameters on the train set as a solution.

[1]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[2]  Alessandro Moschitti,et al.  A Study on Optimal Parameter Tuning for Rocchio Text Classifier , 2003, ECIR.

[3]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[4]  Cornelis H. A. Koster,et al.  Taming Wild Phrases , 2003, ECIR.

[5]  Walter Daelemans,et al.  Combined Optimization of Feature Selection and Algorithm Parameters in Machine Learning of Language , 2003, ECML.

[6]  William S. Cooper,et al.  Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval , 1995, TOIS.

[7]  Walter Daelemans,et al.  Combined Optimization of Feature Selection and Algorithm Parameter Interaction in Machine Learning of Language , 2003 .

[8]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[9]  Amita Goyal Chin Text Databases and Document Management: Theory and Practice , 2000 .

[10]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[11]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  Cornelis H. A. Koster,et al.  Uncertainty-Based Noise Reduction and Term Selection in Text Categorization , 2002, ECIR.

[14]  Hendrik Blockeel,et al.  Machine Learning: ECML 2003 , 2003, Lecture Notes in Computer Science.

[15]  Marc Krier,et al.  Automatic categorisation applications at the European patent office , 2002 .

[16]  C. Koster,et al.  Classifying Patent Applications with Winnow , 2001 .

[17]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[18]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[19]  Koby Crammer,et al.  A new family of online algorithms for category ranking , 2002, SIGIR '02.

[20]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[21]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[22]  Manfred Broy,et al.  Perspectives of System Informatics , 2001, Lecture Notes in Computer Science.

[23]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[24]  Cornelis H. A. Koster,et al.  Multi-classification of Patent Applications with Winnow , 2003, Ershov Memorial Conference.