Diverse training dataset generation based on a multi-objective optimization for semi-Supervised classification

Abstract The self-labeled technique is a type of semi-supervised classification that can be used when labeled data are lacking. Although existing self-labeled techniques show promise in many areas of classification and pattern recognition, they commonly incorrectly label data. The reasons for this problem are the shortage of labeled data and the inappropriate distribution of data in problem space. To deal with this problem, we propose in this paper a synthetic, labeled data generation method based on accuracy and density. Positions of generated data are improved through a multi-objective evolutionary algorithm with two objectives – accuracy and density. The density function generates data with an appropriate distribution and diversity in feature space, whereas the accuracy function eliminates outlier data. The advantage of the proposed method over existing ones is that it simultaneously considers accuracy and distribution of generated data in feature space. We have applied the new proposed method on four self-labeled techniques with different features, i.e., Democratic-co, Tri-training, Co-forest, and Co-bagging. The results show that the proposed method is superior to existing methods in terms of classification accuracy. Also, the superiority of the proposed method is confirmed over other data generation methods such as SMOTE, Borderline SMOTE, Safe-level SMOTE and SMOTE-RSB.

[1]  Di Wu,et al.  A Highly Accurate Framework for Self-Labeled Semisupervised Classification in Industrial Applications , 2018, IEEE Transactions on Industrial Informatics.

[2]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[3]  Zahir Tari,et al.  SemTra: A semi-supervised approach to traffic flow labeling with minimal human effort , 2019, Pattern Recognit..

[4]  Jianjun Li A two-step rejection procedure for testing multiple hypotheses , 2008 .

[5]  Yongquan Zhou,et al.  Twin support vector machines: A survey , 2018, Neurocomputing.

[6]  Korris Fu-Lai Chung,et al.  Semi-supervised classification method through oversampling and common hidden space , 2016, Inf. Sci..

[7]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[10]  Francisco Herrera,et al.  A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules , 2015, Int. J. Comput. Intell. Syst..

[11]  Shahrokh Asadi,et al.  Development of a Reinforcement Learning-based Evolutionary Fuzzy Rule-Based System for diabetes diagnosis , 2017, Comput. Biol. Medicine.

[12]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[13]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[14]  Friedhelm Schwenker,et al.  Combining Committee-Based Semi-Supervised Learning and Active Learning , 2010, Journal of Computer Science and Technology.

[15]  Esmaeil Hadavandi,et al.  Hybridization of evolutionary Levenberg-Marquardt neural networks and data pre-processing for stock market prediction , 2012, Knowl. Based Syst..

[16]  Shahrokh Asadi,et al.  MEMOD: a novel multivariate evolutionary multi-objective discretization , 2017, Soft Computing.

[17]  Guoyin Wang,et al.  Self-training semi-supervised classification based on density peaks of data , 2018, Neurocomputing.

[18]  Nicolás García-Pedrajas,et al.  Nonlinear Boosting Projections for Ensemble Construction , 2007, J. Mach. Learn. Res..

[19]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[20]  Rajeev Agrawal,et al.  2014 International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom) , 2014 .

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[23]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[24]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[25]  Holger H. Hoos,et al.  A survey on semi-supervised learning , 2019, Machine Learning.

[26]  Shahrokh Asadi,et al.  EMDID: Evolutionary multi-objective discretization for imbalanced datasets , 2018, Inf. Sci..

[27]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[28]  Weidong Hu,et al.  Diversity in Machine Learning , 2018, IEEE Access.

[29]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[30]  Naresh Sharma,et al.  Radial Basis Neural Network for Availability Analysis , 2019 .

[31]  Zhi-Hua Zhou,et al.  CoTrade: Confident Co-Training With Data Editing , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[32]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[33]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[34]  José Manuel Benítez,et al.  Self-labeling techniques for semi-supervised time series classification: an empirical study , 2018, Knowledge and Information Systems.

[35]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[36]  Shantanu,et al.  Data analysis using principal component analysis , 2014, 2014 International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom).

[37]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[38]  Jamal Shahrabi,et al.  Complexity-based parallel rule induction for multiclass classification , 2017, Inf. Sci..

[39]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.

[40]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[41]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[42]  Francisco Herrera,et al.  SEG-SSC: A Framework Based on Synthetic Examples Generation for Self-Labeled Semi-Supervised Classification , 2015, IEEE Transactions on Cybernetics.

[43]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[44]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[45]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[46]  Yan Zhou,et al.  Democratic co-learning , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[47]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[48]  H. Finner On a Monotonicity Problem in Step-Down Multiple Test Procedures , 1993 .

[49]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[50]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[51]  Zhi-Hua Zhou,et al.  SETRED: Self-training with Editing , 2005, PAKDD.

[52]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[53]  Nong Sang,et al.  Using clustering analysis to improve semi-supervised classification , 2013, Neurocomputing.

[54]  Zhongsheng Hua,et al.  Semi-supervised learning based on nearest neighbor rule and cut edges , 2010, Knowl. Based Syst..

[55]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[56]  Sonia Garcia-Salicetti,et al.  From aging to early-stage Alzheimer's: Uncovering handwriting multimodal behaviors by semi-supervised learning and sequential representation learning , 2019, Pattern Recognit..

[57]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[58]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[59]  Yide Wang,et al.  Progressive Semisupervised Learning of Multiple Classifiers , 2018, IEEE Transactions on Cybernetics.

[60]  Chao Deng,et al.  A new co-training-style random forest for computer aided diagnosis , 2011, Journal of Intelligent Information Systems.

[61]  Shih-Fu Chang,et al.  Semi-supervised learning using greedy max-cut , 2013, J. Mach. Learn. Res..

[62]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[63]  Gang Wang,et al.  Solution Path for Manifold Regularized Semisupervised Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[64]  Shahrokh Asadi,et al.  An evolutionary deep belief network extreme learning-based for breast cancer diagnosis , 2019, Soft Comput..

[65]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[66]  Yuichiro Anzai,et al.  Pattern Recognition and Machine Learning , 1992, Springer US.

[67]  Adil M. Bagirov,et al.  Clustering in large data sets with the limited memory bundle method , 2018, Pattern Recognit..