Enhancing Binary Classification by Modeling Uncertain Boundary in Three-Way Decisions

Text classification is a process of classifying documents into predefined categories through different classifiers learned from labelled or unlabelled training samples. Many researchers who work on binary text classification attempt to find a more effective way to separate relevant texts from a large data set. However, current text classifiers cannot unambiguously describe the decision boundary between positive and negative objects because of uncertainties caused by text feature selection and the knowledge learning process. This paper proposes a three-way decision model for dealing with the uncertain boundary to improve the binary text classification performance based on the <italic> rough set</italic> techniques and centroid solution. It aims to understand the uncertain boundary through partitioning the training samples into three regions (the positive, boundary, and negative regions) by two main boundary vectors <inline-formula><tex-math notation="LaTeX">$\vec{C_{P}}$</tex-math><alternatives> <inline-graphic xlink:href="li-ieq1-2681671.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\vec{C_{N}}$</tex-math><alternatives> <inline-graphic xlink:href="li-ieq2-2681671.gif"/></alternatives></inline-formula>, created from the labeled positive and negative training subsets, respectively, and further resolve the objects in the boundary region by two derived boundary vectors <inline-formula><tex-math notation="LaTeX">$\vec{B_{P}}$</tex-math><alternatives> <inline-graphic xlink:href="li-ieq3-2681671.gif"/></alternatives></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\vec{B_{N}}$</tex-math><alternatives> <inline-graphic xlink:href="li-ieq4-2681671.gif"/></alternatives></inline-formula>, produced according to the structure of the boundary region. It involves an indirect strategy which is composed of two successive steps in the whole classification process: ‘two-way to three-way’ and ‘three-way to two-way’. Four decision rules are proposed from the training process and applied to the incoming documents for more precise classification. A large number of experiments have been conducted based on the standard data sets RCV1 and Reuters-21578. The experimental results show that the usage of boundary vectors is very effective and efficient for dealing with uncertainties of the decision boundary, and the proposed model has significantly improved the performance of binary text classification in terms of <inline-formula><tex-math notation="LaTeX">$F_{1}$</tex-math><alternatives> <inline-graphic xlink:href="li-ieq5-2681671.gif"/></alternatives></inline-formula> measure and <inline-formula> <tex-math notation="LaTeX">$AUC$</tex-math><alternatives><inline-graphic xlink:href="li-ieq6-2681671.gif"/> </alternatives></inline-formula> area compared with six other popular baseline models.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[3]  Aïcha Mokhtari,et al.  Combining supervised term-weighting metrics for SVM text classification with extended term representation , 2016, Knowledge and Information Systems.

[4]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[5]  Gang Zhou,et al.  An Extensive Empirical Study of Feature Selection for Text Categorization , 2008, Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008).

[6]  Yue Xu,et al.  Centroid Training to achieve effective text classification , 2014, 2014 International Conference on Data Science and Advanced Analytics (DSAA).

[7]  Yongping Huang,et al.  A Text Classification Algorithm Based on Rocchio and Hierarchical Clustering , 2011, ICIC.

[8]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[9]  Yiyu Yao,et al.  Decision-Theoretic Rough Set Models , 2007, RSKT.

[10]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[11]  Theresa Beaubouef,et al.  Rough Sets , 2019, Lecture Notes in Computer Science.

[12]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[13]  Yiyu Yao,et al.  Advances in three-way decisions and granular computing , 2016, Knowl. Based Syst..

[14]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[15]  Yuefeng Li,et al.  A Pattern Based Two-Stage Text Classifier , 2013, MLDM.

[16]  Zdzislaw Pawlak,et al.  Rough sets, decision algorithms and Bayes' theorem , 2002, Eur. J. Oper. Res..

[17]  Joseph P. Herbert,et al.  Criteria for choosing a rough set model , 2009, Comput. Math. Appl..

[18]  Z. Pawlak Rough set approach to knowledge-based decision support , 1997 .

[19]  Christopher C. Yang Search Engines Information Retrieval in Practice , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Johannes Fürnkranz,et al.  Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[21]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[22]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[23]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[24]  Songbo Tan,et al.  An improved centroid classifier for text categorization , 2008, Expert Syst. Appl..

[25]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[26]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[27]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[28]  Yiyu Yao,et al.  Probabilistic rough set approximations , 2008, Int. J. Approx. Reason..

[29]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[30]  Rudy Setiono,et al.  A comparative study of centroid-based, neighborhood-based and statistical approaches for effective document categorization , 2002, Object recognition supported by user interaction for service robots.

[31]  Raymond Y. K. Lau,et al.  Towards a belief-revision-based adaptive and context-sensitive information retrieval system , 2008, TOIS.

[32]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[33]  Yuefeng Li,et al.  Mining positive and negative patterns for relevance feature discovery , 2010, KDD.

[34]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[35]  Jerzy W. Grzymala-Busse,et al.  Knowledge acquisition under uncertainty — a rough set approach , 1988, J. Intell. Robotic Syst..

[36]  Laliteshwari,et al.  Relevance Feature Discovery for Text Mining , 2016 .

[37]  Yiyu Yao,et al.  Three-way decisions with probabilistic rough sets , 2010, Inf. Sci..

[38]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[39]  Chengqi Zhang,et al.  An information filtering model on the Web and its application in JobAgent , 2000, Knowl. Based Syst..

[40]  Yiyu Yao,et al.  Three-Way Decision: An Interpretation of Rules in Rough Set Theory , 2009, RSKT.

[41]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[42]  Angela Schwering,et al.  Spatial Relations for Semantic Similarity Measurement , 2005, ER.

[43]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[44]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[45]  S. K. Michael Wong,et al.  Rough Sets: Probabilistic versus Deterministic Approach , 1988, Int. J. Man Mach. Stud..

[46]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[47]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[48]  Yuefeng Li,et al.  Rough Set Based Approach to Text Classification , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).