Sample imbalance disease classification model based on association rule feature selection

Abstract In the research of computer-aided diagnosis, the shortage of disease feature dimension curse and the imbalance of medical samples have always been the focus of research on diagnostic decision support systems. For these two problems, we propose a feature selection algorithm based on association rules and an integrated classification algorithm based on random equilibrium sampling. We extracted and cleaned the electronic medical record text obtained from the hospital to obtain a diabetes data set. The proposed algorithm was verified in this data set and the public data set UCI. Experimental results show that the feature selection algorithm based on association rules is better than the CART, ReliefF and RFE-SVM algorithms in terms of feature dimension and classification accuracy. The proposed integrated classification algorithm based on random equalization sampling is superior to the comparative SMOTE-Boost and SMOTE-RF algorithms in macro precision, macro-full rate and macro F1 value, which embodies the robustness of the algorithm.

[1]  Anthony G. Cohn,et al.  An Effective Approach for Imbalanced Classification: Unevenly Balanced Bagging , 2013, AAAI.

[2]  Ting Guo,et al.  Teeth category classification via seven‐layer deep convolutional neural network with max pooling and global average pooling , 2019, Int. J. Imaging Syst. Technol..

[3]  Nikos Fakotakis,et al.  Learning Greek Verb Complements: Addressing the Class Imbalance , 2004, COLING.

[4]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[5]  Maozhen Li,et al.  An Annotation Model on End-to-End Chest Radiology Reports , 2019, IEEE Access.

[6]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[7]  Chidchanok Lursinsap,et al.  Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques , 2013, Pattern Recognit. Lett..

[8]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[9]  Xi Zhang,et al.  CGMOS: Certainty Guided Minority OverSampling , 2016, CIKM.

[10]  Yudong Zhang,et al.  Multiple sclerosis identification by convolutional neural network with dropout and parametric ReLU , 2018, J. Comput. Sci..

[11]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[12]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[13]  Eta S. Berner,et al.  Clinical Decision Support Systems , 1999, Health Informatics.

[14]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[15]  Junding Sun,et al.  High Performance Multiple Sclerosis Classification by Data Augmentation and AlexNet Transfer Learning Model , 2019, J. Medical Imaging Health Informatics.

[16]  William H. Hsu,et al.  Genetic wrappers for feature selection in decision tree induction and variable ordering in Bayesian network structure learning , 2004, Inf. Sci..

[17]  Alexey Tsymbal,et al.  Ensemble feature selection with the simple Bayesian classification , 2003, Inf. Fusion.

[18]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[21]  Chenxi Huang,et al.  Multiple Sclerosis Identification by 14-Layer Convolutional Neural Network With Batch Normalization, Dropout, and Stochastic Pooling , 2018, Front. Neurosci..

[22]  Juan José Rodríguez Diez,et al.  Using Model Trees and Their Ensembles for Imbalanced Data , 2011, CAEPIA.

[23]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[24]  Yu-Dong Zhang,et al.  Chinese Sign Language Fingerspelling via Six-Layer Convolutional Neural Network with Leaky Rectified Linear Units for Therapy and Rehabilitation , 2019, J. Medical Imaging Health Informatics.

[25]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[27]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[28]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[29]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[30]  Randy J. Pell,et al.  Genetic algorithms combined with discriminant analysis for key variable identification , 2004 .