Imbalanced text classification: A term weighting approach

The natural distribution of textual data used in text classification is often imbalanced. Categories with fewer examples are under-represented and their classifiers often perform far below satisfactory. We tackle this problem using a simple probability based term weighting scheme to better distinguish documents in minor categories. This new scheme directly utilizes two critical information ratios, i.e. relevance indicators. Such relevance indicators are nicely supported by probability estimates which embody the category membership. Our experimental study using both Support Vector Machines and Naive Bayes classifiers and extensive comparison with other classic weighting schemes over two benchmarking data sets, including Reuters-21578, shows significant improvement for minor categories, while the performance for major categories are not jeopardized. Our approach has suggested a simple and effective solution to boost the performance of text classification over skewed data sets.

[1]  Rayid Ghani,et al.  Combining Labeled and Unlabeled Data for MultiClass Text Categorization , 2002, ICML.

[2]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[3]  M. Dolores del Castillo,et al.  A multistrategy approach for digital text categorization from imbalanced documents , 2004, SKDD.

[4]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[5]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Marko Grobelnik,et al.  Training text classifiers with SVM on very few positive examples , 2003 .

[9]  Shiwen Yu,et al.  An adaptive k-nearest neighbor text categorization strategy , 2004, TALIP.

[10]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[11]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[12]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[13]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[14]  Han Tong Loh,et al.  Corpus Building for Corporate Knowledge Discovery and Management: A Case Study of Manufacturing , 2007, KES.

[15]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[16]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[17]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[18]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[19]  Haym Hirsh,et al.  Improving Short-Text Classification using Unlabeled Data for Classification Problems , 2000, ICML.

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[21]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[22]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[23]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[24]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[25]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[26]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[27]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[28]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[29]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[30]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[32]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[33]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[34]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[35]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[36]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[37]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[38]  SrivastavaJaideep,et al.  Blocking Reduction Strategies in Hierarchical Text Classification , 2004 .

[39]  Haym Hirsh,et al.  Improving Short Text Classification Using Unlabeled Background Knowledge , 2000, ICML 2000.

[40]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[41]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[42]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[43]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[44]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[45]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[46]  Jaideep Srivastava,et al.  Blocking reduction strategies in hierarchical text classification , 2004, IEEE Transactions on Knowledge and Data Engineering.

[47]  Yiming Yang Sampling Strategies and Learning Efficiency in Text Categorization , 1996 .

[48]  Alexander Y. Liu The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets , 2004 .

[49]  Philip S. Yu,et al.  Mining Extremely Skewed Trading Anomalies , 2004, EDBT.

[50]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.