Selecting critical features for data classification based on machine learning methods

Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.

[1]  Sunu Wibirama,et al.  Gender recognition using PCA and LDA with improve preprocessing and classification technique , 2017, 2017 7th International Annual Engineering Seminar (InAES).

[2]  V. Rodriguez-Galiano,et al.  Machine learning predictive models for mineral prospectivity: an evaluation of neural networks, random forest, regression trees and support vector machines , 2015 .

[3]  Francis K. C. Hui,et al.  So Many Variables: Joint Modeling in Community Ecology. , 2015, Trends in ecology & evolution.

[4]  Aboul Ella Hassanien,et al.  Linear discriminant analysis: A detailed tutorial , 2017, AI Commun..

[5]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[6]  J. Piatt,et al.  Receiver-operating characteristic curves. , 2001, Journal of neurosurgery.

[7]  Ali Haidar,et al.  A novel approach for optimizing climate features and network parameters in rainfall forecasting , 2018, Soft Comput..

[8]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[9]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  Massimo Castagnola,et al.  Inactivation of Human Salivary Glutathione Transferase P1-1 by Hypothiocyanite: A Post-Translational Control System in Search of a Role , 2014, PloS one.

[12]  Sakhinah Abu Bakar,et al.  Evaluation performance of Hybrid Localized Multi Kernel SVR (LMKSVR) in electrical load data using 4 different optimizations , 2018 .

[13]  Sakhinah Abu Bakar,et al.  Neurocomputing fundamental climate analysis , 2019, TELKOMNIKA (Telecommunication Computing Electronics and Control).

[14]  Yinhui Li,et al.  An efficient intrusion detection system based on support vector machines and gradually feature removal method , 2012, Expert Syst. Appl..

[15]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  Huaping Zhang,et al.  Optimization of risk control in financial markets based on particle swarm optimization algorithm , 2020, J. Comput. Appl. Math..

[18]  D. Suryanarayana,et al.  A Comparative Study of Random Forest & K – Nearest Neighbors on HAR dataset Using Caret , 2017 .

[19]  Bin Zhou,et al.  A regional adaptive variational PDE model for computed tomography image reconstruction , 2019, Pattern Recognit..

[20]  José Maria Monteiro,et al.  Leveraging feature selection to detect potential tax fraudsters , 2020, Expert Syst. Appl..

[21]  Glenn Fung,et al.  SVM Feature Selection for Classification of SPECT Images of Alzheimer's Disease Using Spatial Information , 2005, ICDM.

[22]  Patrick E. McSharry,et al.  Constructing spatiotemporal poverty indices from big data , 2017 .

[23]  Symeon Chatzinotas,et al.  IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , 2020 .

[24]  Alena Bartonova,et al.  A global database for metacommunity ecology, integrating species, traits, environment and space , 2020, Scientific Data.

[25]  Arun K. Somani,et al.  Understanding the Data Science behind Business Analytics , 2017 .

[26]  ScienceDirect Procedia economics and finance , 2012 .

[27]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[28]  Jeen-Shing Wang,et al.  Using acceleration measurements for activity recognition: An effective learning algorithm for constructing neural classifiers , 2008, Pattern Recognit. Lett..

[29]  S. D. Vito,et al.  CO, NO2 and NOx urban pollution monitoring with on-field calibrated electronic nose by automatic bayesian regularization , 2009 .

[30]  Xuejian Li,et al.  Very High Resolution Remote Sensing Imagery Classification Using a Fusion of Random Forest and Deep Learning Technique—Subtropical Area for Example , 2020, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[31]  W. Ambrosius,et al.  Application of Random Forests Methods to Diabetic Retinopathy Classification Analyses , 2014, PloS one.

[32]  E. Kannan,et al.  An efficient framework for heart disease classification using feature extraction and feature selection technique in data mining , 2016, 2016 International Conference on Emerging Trends in Engineering, Technology and Science (ICETETS).

[33]  Lekha Bhambhu,et al.  DATA CLASSIFICATION USING SUPPORT VECTOR MACHINE , 2009 .

[34]  Shuai Liu,et al.  Fractal Intelligent Privacy Protection in Online Social Network Using Attribute-Based Encryption Schemes , 2018, IEEE Transactions on Computational Social Systems.

[35]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[36]  Khasge Santosh Shahabad Stone Misuse Scheme in India , 2017 .

[37]  José Hernández-Orallo,et al.  ROC curves for regression , 2013, Pattern Recognit..

[38]  José Mira-McWilliams,et al.  Important variable assessment and electricity price forecasting based on regression tree models: classification and regression trees, Bagging and Random Forests , 2015 .

[39]  Ganesh Chandra Deka NoSQLDatabase for Storage and Retrieval of Data in Cloud , 2017 .

[40]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[41]  R. Derrig Insurance Fraud , 1996 .

[42]  Xuan Wang,et al.  Research on classification method of high-dimensional class-imbalanced datasets based on SVM , 2019, Int. J. Mach. Learn. Cybern..

[43]  Raffaella Guida,et al.  Earthquake Damage Detection in Urban Areas Using Curvilinear Features , 2013, IEEE Transactions on Geoscience and Remote Sensing.

[44]  Liu Yong,et al.  A New Feature Selection Method for Text Classification Based on Independent Feature Space Search , 2020 .

[45]  Krzysztof Siwek,et al.  Principal component analysis (PCA) for feature selection at the diagnosis of electrical circuits , 2003 .

[46]  Jitendra Kumar Jaiswal,et al.  Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression , 2017, 2017 World Congress on Computing and Communication Technologies (WCCCT).

[47]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[48]  Rung-Ching Chen,et al.  Human Activity Recognition Based on Evolution of Features Selection and Random Forest , 2019, 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).

[49]  Zhi-Hua Zhou,et al.  A k-nearest neighbor based algorithm for multi-label classification , 2005, 2005 IEEE International Conference on Granular Computing.

[50]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[51]  Jianwei Guo,et al.  Research on classification method of high-dimensional class-imbalanced datasets based on SVM , 2017, International Journal of Machine Learning and Cybernetics.

[52]  Mohammad Reza Bonyadi,et al.  Optimization of Distributions Differences for Classification , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[53]  Bin Zhou,et al.  Holes Detection in Anisotropic Sensornets: Topological Methods , 2012, Int. J. Distributed Sens. Networks.

[54]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[55]  Hamidreza Zareipour,et al.  A New Feature Selection Technique for Load and Price Forecast of Electrical Power Systems , 2017, IEEE Transactions on Power Systems.

[56]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[57]  Fan Zhu,et al.  RSLIME: An Efficient Feature Importance Analysis Approach for Industrial Recommendation Systems , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[58]  Rezzy Eko Caraka,et al.  Ramadhan Short-Term Electric Load : A Hybrid Model of Cycle Spinning Wavelet and Group Method Data Handling ( CSW-GMDH ) , 2019 .

[59]  Rung-Ching Chen,et al.  Using Deep Learning to Predict User Rating on Imbalance Classification Data , 2019 .

[60]  Shaikh Anowarul Fattah,et al.  Automated Brain Tumor Segmentation Based on Multi-Planar Superpixel Level Features Extracted From 3D MR Images , 2020, IEEE Access.

[61]  Hui-Huang Hsu,et al.  Hybrid feature selection by combining filters and wrappers , 2011, Expert Syst. Appl..

[62]  Pilar García-Díaz,et al.  Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data. , 2019, Genomics.

[63]  Hamza Turabieh,et al.  Enhanced Binary Moth Flame Optimization as a Feature Selection Algorithm to Predict Software Fault Prediction , 2020, IEEE Access.

[64]  Hui Li,et al.  Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting , 2020, Inf. Fusion.

[65]  Vadlamani Ravi,et al.  Detection of financial statement fraud and feature selection using data mining techniques , 2011, Decis. Support Syst..

[66]  Paraskevas Tsangaratos,et al.  Flash flood susceptibility modeling using an optimized fuzzy rule based feature selection technique and tree based ensemble methods. , 2019, The Science of the total environment.

[67]  Robertas Damasevicius,et al.  Multi-sink distributed power control algorithm for Cyber-physical-systems in coal mine tunnels , 2019, Comput. Networks.

[68]  Arif Budiarto,et al.  Features Importance in Classification Models for Colorectal Cancer Cases Phenotype in Indonesia , 2019 .

[69]  Glenn Fung,et al.  SVM feature selection for classification of SPECT images of Alzheimer's disease using spatial information , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[70]  Xiao Xiang Zhu,et al.  Feature Extraction and Selection of Sentinel-1 Dual-Pol Data for Global-Scale Local Climate Zone Classification , 2018, ISPRS Int. J. Geo Inf..

[71]  Kai Ming Ting,et al.  Confusion Matrix , 2010, Encyclopedia of Machine Learning and Data Mining.

[72]  Azuraliza Abu Bakar,et al.  Machine Learning Approach for Bottom 40 Percent Households (B40) Poverty Classification , 2018, International Journal on Advanced Science, Engineering and Information Technology.

[73]  Mohammad Saniee Abadeh,et al.  A multi-layered incremental feature selection algorithm for adjuvant chemotherapy effectiveness/futileness assessment in non-small cell lung cancer , 2017 .

[74]  Arif Budiarto,et al.  Hybrid support vector regression in electric load during national holiday season , 2017, 2017 International Conference on Innovative and Creative Information Technology (ICITech).

[75]  Jing Zhang,et al.  A Bijection between Lattice-Valued Filters and Lattice-Valued Congruences in Residuated Lattices , 2013 .

[76]  M. Sivabalakrishnan,et al.  Feature Selection of Gene Expression Data for Cancer Classification: A Review , 2015 .

[77]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning and Data Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[78]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[79]  Oscar Barquero-Perez,et al.  Use of a K-nearest neighbors model to predict the development of type 2 diabetes within 2 years in an obese, hypertensive population , 2020, Medical & Biological Engineering & Computing.

[80]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[81]  P. Swarnalatha,et al.  Optimal feature selection through a cluster-based DT learning (CDTL) in heart disease prediction , 2020, Evolutionary Intelligence.

[82]  Benjamin Bechtel,et al.  Classification of Local Climate Zones Based on Multiple Earth Observation Data , 2012, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[83]  Shweta Sankhwar,et al.  Improved grey wolf optimization-based feature subset selection with fuzzy neural classifier for financial crisis prediction , 2019, Soft Computing.

[84]  Ashok Ghatol,et al.  Feature selection for medical diagnosis : Evaluation for cardiovascular diseases , 2013, Expert Syst. Appl..

[85]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[86]  Amir Mosavi,et al.  Flash-flood hazard assessment using ensembles and Bayesian-based machine learning models: Application of the simulated annealing feature selection method. , 2019, The Science of the total environment.

[87]  Mikhail Kanevski,et al.  Machine Learning Feature Selection Methods for Landslide Susceptibility Mapping , 2013, Mathematical Geosciences.

[88]  Jianhua Tao,et al.  Features Importance Analysis for Emotional Speech Classification , 2005, ACII.

[89]  L. Wang,et al.  GI/Geom/1 queue based on communication model for mesh networks , 2014, Int. J. Commun. Syst..

[90]  Osman Tayfun Biskin,et al.  Selecting Macroeconomic Influencers on Stock Markets by Using Feature Selection Algorithms , 2015 .

[91]  Rezzy Eko Caraka,et al.  Prediction of Euro 50 Using Back Propagation Neural Network ( BPNN ) and Genetic Algorithm ( GA ) , 2017 .

[92]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[93]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[94]  José Luis Risco-Martín,et al.  An application of machine learning with feature selection to improve diagnosis and classification of neurodegenerative disorders , 2019, BMC Bioinformatics.

[95]  Vladimir Vovk,et al.  Empirical Inference - Festschrift in Honor of Vladimir N. Vapnik , 2014, Empirical Inference.

[96]  U. Grömping Dependence of Variable Importance in Random Forests on the Shape of the Regressor Space , 2009 .

[97]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[98]  Wen Liu,et al.  Learning Change from Synthetic Aperture Radar Images: Performance Evaluation of a Support Vector Machine to Detect Earthquake and Tsunami-Induced Changes , 2016, Remote. Sens..

[99]  Jian Su,et al.  CDMA-based anti-collision algorithm for EPC global C1 Gen2 systems , 2018, Telecommun. Syst..

[100]  D. Vere-Jones,et al.  Analyzing earthquake clustering features by using stochastic reconstruction , 2004 .

[101]  Wan-Young Chung,et al.  High Accuracy Human Activity Monitoring Using Neural Network , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[102]  Arnau Oliver,et al.  BOOST: A supervised approach for multiple sclerosis lesion segmentation , 2014, Journal of Neuroscience Methods.

[103]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[104]  Kushan De Silva,et al.  A combined strategy of feature selection and machine learning to identify predictors of prediabetes , 2019, J. Am. Medical Informatics Assoc..

[105]  G. Papaefthymiou,et al.  Stochastic Modeling of Power Demand Due to EVs Using Copula , 2012, IEEE Transactions on Power Systems.

[106]  Antonio J Torija,et al.  A general procedure to generate models for urban environmental-noise pollution using feature selection and machine learning methods. , 2015, The Science of the total environment.

[107]  Preeti Aggarwal,et al.  Feature Selection Using SEER Data for the Survivability of Ovarian Cancer Patients , 2020 .

[108]  Luis Angel García-Escudero,et al.  A review of robust clustering methods , 2010, Adv. Data Anal. Classif..

[109]  Rung-Ching Chen,et al.  Feature importance of the aortic anatomy on endovascular aneurysm repair (EVAR) using Boruta and Bayesian MCMC , 2020 .

[110]  Vijayan N. Nair,et al.  A REVIEW AND RECENT DEVELOPMENTS , 2005 .

[111]  Bens Pardamean,et al.  Biclustering Method to Capture the Spatial Pattern and to Identify the Causes of Social Vulnerability in Indonesia: A New Recommendation for Disaster Mitigation Policy , 2019, Procedia Computer Science.

[112]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[113]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[114]  Rung Ching Chen,et al.  Prediction of Status Particulate Matter 2.5 Using State Markov Chain Stochastic Process and HYBRID VAR-NN-PSO , 2019, IEEE Access.

[115]  David Kubánek,et al.  Digital modulation classification based on characteristic features and GentleBoost algorithm , 2011, 2011 34th International Conference on Telecommunications and Signal Processing (TSP).

[116]  M. Conner,et al.  Methods to quantify variable importance: implications for the analysis of noisy ecological data. , 2009, Ecology.

[117]  Fei Yan,et al.  Fast Adaptive K-Means Subspace Clustering for High-Dimensional Data , 2019, IEEE Access.

[118]  Wei Wei,et al.  Gradient-driven parking navigation using a continuous information potential field based on wireless sensor network , 2017, Inf. Sci..

[119]  Dedy Dwi Prastyo,et al.  VAR and GSTAR-Based Feature Selection in Support Vector Regression for Multivariate Spatio-Temporal Forecasting , 2018 .

[120]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..