Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets

Selection of the optimal combination of imputation method and classifier is very costly.A novel method of automatic, adaptive selection of the optimal combination, AMCI, is proposed.Successfully demonstrate the superiority of the proposed method with multiple data sets.The results also suggest that AMCI is scalable: good for bid data analytics and IoT applications. Classifiers and imputation methods have played crucial parts in the field of big data analytics. Especially, when using data sets characterized by horizontal scattering, vertical scattering, level of spread, compound metric, imbalance ratio and missing ratio, how to combine those classifiers and imputation methods will lead to significantly different performance. Therefore, it is essential that the characteristics of data sets must be identified in advance to facilitate selection of the optimal combination of imputation methods and classifiers. However, this is a very costly process. The purpose of this paper is to propose a novel method of automatic, adaptive selection of the optimal combination of classifier and imputation method on the basis of features of a given data set. The proposed method turned out to successfully demonstrate the superiority in performance evaluations with multiple data sets. The decision makers in big data analytics could greatly benefit from the proposed method when it comes to dealing with data set in which the distribution of missing data varies in real time.

[1]  Jin Kwak,et al.  Social Network Service Real Time Data Analysis Process Research , 2014, FCC.

[2]  Reza Ebrahimi Atani,et al.  Ubiquitous IoT structure via homogeneous data type modelling , 2014, 7'th International Symposium on Telecommunications (IST'2014).

[3]  Piotr Porwik,et al.  Investigation of the Impact of Missing Value Imputation Methods on the k-NN Classification Accuracy , 2015, ICCCI.

[4]  Taghi M. Khoshgoftaar,et al.  Using Imputation Techniques to Help Learn Accurate Classifiers , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[5]  Wenpin Tsai,et al.  Social comparison among competing firms , 2012 .

[6]  David Johnstone,et al.  An empirical evaluation of the performance of binary classifiers in the prediction of credit ratings changes , 2015 .

[7]  Xinwang Liu,et al.  Sample-Based Extreme Learning Machine with Missing Data , 2015 .

[8]  Eduardo R. Hruschka,et al.  An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks , 2013, Data Knowl. Eng..

[9]  Genshe Chen,et al.  Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier , 2013, 2013 IEEE International Conference on Big Data.

[10]  Sarunas Raudys,et al.  On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Chao Jiang,et al.  CKNNI: An Improved KNN-Based Missing Value Handling Technique , 2015, ICIC.

[12]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[13]  Ming Dong,et al.  Selection-fusion approach for classification of datasets with missing values , 2010, Pattern Recognit..

[14]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[15]  Bhekisipho Twala,et al.  Ensemble imputation methods for missing software engineering data , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[16]  Haiyi Xie,et al.  Multilevel Models: Applications Using SAS , 2011 .

[17]  Pilsung Kang,et al.  Locally linear reconstruction based missing value imputation for supervised learning , 2013, Neurocomputing.

[18]  Victor C. M. Leung,et al.  Big Data Applications , 2014 .

[19]  Albert Bifet,et al.  Mining Big Data in Real Time , 2013, Informatica.

[20]  Francisco Herrera,et al.  A study on the use of imputation methods for experimentation with Radial Basis Function Network classifiers handling missing attribute values: The good synergy between RBFNs and EventCovering method , 2010, Neural Networks.

[21]  C. Wrzus,et al.  Social network changes and life events across the life span: a meta-analysis. , 2013, Psychological bulletin.

[22]  Ohbyung Kwon,et al.  Missing Values and Optimal Selection of an Imputation Method and Classification Algorithm to Improve the Accuracy of Ubiquitous Computing Applications , 2015 .

[23]  Zuhal Tanrikulu,et al.  A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness , 2012, Information Technology and Management.

[24]  Yulong Zhao,et al.  Research on Parameters Optimization of SVM Based on Improved Fruit Fly Optimization Algorithm , 2016 .

[25]  M. Okamoto An Asymptotic Expansion for the Distribution of the Linear Discriminant Function , 1963 .

[26]  Bin Ran,et al.  Robust Missing Traffic Flow Imputation Considering Nonnegativity and Road Capacity , 2014 .

[27]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[28]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[29]  Yuval Elovici,et al.  Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population , 2013, ACM Trans. Intell. Syst. Technol..

[30]  B. Zhu,et al.  One-Step Dynamic Classifier Ensemble Model for Customer Value Segmentation with Missing Values , 2014 .

[31]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[32]  Ito Wasito,et al.  Nearest neighbours in least-squares data imputation algorithms with different missing patterns , 2006, Comput. Stat. Data Anal..

[33]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[34]  Li Li,et al.  Missing traffic data: comparison of imputation methods , 2014 .

[35]  Mengjie Zhang,et al.  Impact of imputation of missing values on genetic programming based multiple feature construction for classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[36]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .

[37]  Wei Ping Loh,et al.  Data Treatment Effects on Classification Accuracies of Bipedal Running and Walking Motions , 2014, SCDM.

[38]  Ohbyung Kwon,et al.  Effects of data set features on the performances of classification algorithms , 2013, Expert Syst. Appl..

[39]  Dae-Ki Kang,et al.  Experimental analysis of naïve Bayes classifier based on an attribute weighting framework with smooth kernel density estimations , 2015, Applied Intelligence.

[40]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[41]  P Pawan,et al.  Data Mining with Big Data Using HACE Theorem , 2015 .

[42]  Mianxiong Dong,et al.  Managing Heterogeneous Sensor Data on a Big Data Platform: IoT Services for Data-Intensive Science , 2014, 2014 IEEE 38th International Computer Software and Applications Conference Workshops.

[43]  Matthijs van Leeuwen,et al.  VIPER - Visual Pattern Explorer , 2015, ECML/PKDD.

[44]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..