A NOVEL TERM WEIGHTING SCHEME MIDF FOR TEXT CATEGORIZATION

Text categorization is a task of automatically assigning documents to a set of predefined categories. Usually it involves a document representation method and term weighting scheme. This paper proposes a new term weighting scheme called Modified Inverse Document Frequency (MIDF) to improve the performance of text categorization. The document represented in MIDF is trained using the support vector machines classifier with radial basis function kernel. The experiments are carried out in Reuters-21578 corpora. The performance measures taken for text categorization are F1–measure and cost measure. The proposed term weighting scheme performs better than the existing term weighting schemes.

[1]  Makoto Suzuki,et al.  Text categorization based on the ratio of word frequency in each categories , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[2]  Manu Konchady Text Mining Application Programming , 2006 .

[3]  Chunping Li,et al.  A Novel Term Weighting Scheme for Automated Text Categorization , 2007, Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007).

[4]  Meng Chang Chen,et al.  Using Incremental PLSI for Threshold-Resilient Online Event Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[5]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  Takenobu Tokunaga,et al.  Text Categorization based on Weighted Inverse Document Frequency , 1994 .