A novel framework for termset selection and weighting in binary text classification

This study presents a new framework for termset selection and weighting. The proposed framework is based on employing the joint occurrence statistics of pairs of terms for termset selection and weighting. More specifically, each termset is evaluated by taking into account the simultaneous or individual occurrences of the terms within the termset. Based on the idea that the occurrence of one term but not the other may also convey valuable information for discrimination, the conventionally used term selection schemes are adapted to be employed for termset selection. Similarly, the weight of a selected termset is computed as a function of the terms that occur in the document under concern where a termset is assigned a nonzero weight if either or both of the terms appear in the document. This weight estimation scheme allows evaluation of the individual occurrences of the terms and their co-occurrences separately so as to compute the document-specific weight of each termset. The proposed termset-based representation is concatenated with the bag-of-words approach to construct the document vectors. Experiments conducted on three widely used datasets have verified the effectiveness of the proposed framework.

[1]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[6]  Anjali Ganesh Jivani,et al.  A Comparative Study of Stemming Algorithms , 2011 .

[7]  Houda Benbrahim,et al.  An empirical study to address the problem of Unbalanced Data Sets in sentiment classification , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[8]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[9]  Soon Myoung Chung,et al.  Combining Multiple Feature Selection Methods for Text Categorization by Using Rank-Score Characteristics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[10]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[11]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Arlindo L. Oliveira,et al.  An Empirical Comparison of Text Categorization Methods , 2003, SPIRE.

[14]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[15]  Patrick Henry Winston,et al.  Representation and Learning , 1982 .

[16]  Hong‐Hee Lee,et al.  Abstract , 1998, Veterinary Record.

[17]  Dunja Mladenic,et al.  Word sequences as features in text-learning , 1998 .

[18]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[19]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[22]  Levent Özgür,et al.  Text classification with the support of pruned dependency patterns , 2010, Pattern Recognit. Lett..

[23]  Hakan Altinçay,et al.  Analytical evaluation of term weighting schemes for text categorization , 2010, Pattern Recognit. Lett..

[24]  Fabrizio Sebastiani,et al.  An analysis of the relative hardness of Reuters-21578 subsets: Research Articles , 2005 .

[25]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[26]  Xijin Tang,et al.  Text classification based on multi-word with support vector machine , 2008, Knowl. Based Syst..

[27]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[28]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[29]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[30]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[31]  Zhang Yi,et al.  Free-gram phrase identification for modeling Chinese text , 2013, Inf. Process. Lett..

[32]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[33]  Mari Ostendorf,et al.  Classification by Augmenting the Bag-of-Words Representation with Redundancy-Compensated Bigrams ∗ , 2005 .

[34]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Hiroshi Ogura,et al.  Comparison of metrics for feature selection in imbalanced text classification , 2011, Expert Syst. Appl..

[36]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[37]  Houda Benbrahim,et al.  REPRESENTING TEXT DOCUMENTS IN TRAINING DOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT REPRESENTATION , 2013 .

[38]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[39]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[40]  Hakan Altinçay,et al.  Explicit Use of Term Occurrence Probabilities for Term Weighting in Text Categorization , 2011, J. Inf. Sci. Eng..

[41]  Osmar R. Zaïane,et al.  Considering Re-occurring Features in Associative Classifiers , 2005, PAKDD.

[42]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[43]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[44]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[45]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[46]  Karel Jezek,et al.  Extending the single words-based document model: a comparison of bigrams and 2-itemsets , 2006, DocEng '06.

[47]  Huang Zou,et al.  Sentiment Classification Using Machine Learning Techniques with Syntax Features , 2015, 2015 International Conference on Computational Science and Computational Intelligence (CSCI).

[48]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[49]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[50]  Osmar R. Zaïane,et al.  Classifying Text Documents by Associating Terms With Text Categories , 2002, Australasian Database Conference.

[51]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[52]  LiuYing,et al.  Imbalanced text classification , 2009 .

[53]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[54]  Bernardete Ribeiro,et al.  The importance of stop word removal on recall values in text categorization , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..