Improving SVM Text Classification Performance through Threshold Adjustment

In general, support vector machines (SVM), when applied to text classification provide excellent precision, but poor recall. One means of customizing SVMs to improve recall, is to adjust the threshold associated with an SVM. We describe an automatic process for adjusting the thresholds of generic SVM which incorporates a user utility model, an integral part of an information management system. By using thresholds based on utility models and the ranking properties of classifiers, it is possible to overcome the precision bias of SVMs and insure robust performance in recall across a wide variety of topics, even when training data are sparse. Evaluations on TREC data show that our proposed threshold adjusting algorithm boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.

[1]  Claudio Gentile,et al.  Kernel Methods for Document Filtering , 2002, TREC.

[2]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[3]  Jeffrey Bennett,et al.  Topic-Specific Optimization and Structuring , 2001, TREC.

[4]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[7]  Stephen Robertson,et al.  The TREC-2001 Filtering Track Report | NIST , 2002 .

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[11]  Christine D. Piatko,et al.  JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval , 2001, TREC.

[12]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[13]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[14]  Avi Arampatzis,et al.  Unbiased S-D Threshold Optimization, Initial Query Degradation, Decay, and Incrementality, for Adaptive Document Filtering , 2001, TREC.

[16]  Ellen M. Voorhees,et al.  Overview of TREC 2003 , 2003, TREC.

[17]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  Yiming Yang,et al.  kNN, Rocchio and Metrics for Information Filtering at TREC-10 , 2001, TREC.

[21]  Paul Over,et al.  Interactivity at the Text Retrieval Conference (TREC) , 2001, Inf. Process. Manag..

[22]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[23]  Yi Zhang,et al.  YFilter at TREC-9 , 2000, TREC.

[24]  David D. Lewis,et al.  Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks , 2001, TREC.

[25]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[26]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[27]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[28]  Peter Jansen,et al.  Threshold Calibration in CLARIT Adaptive Filtering , 1998, TREC.