Data preprocessing in predictive data mining

A large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[3]  Tzu-Tsung Wong,et al.  A hybrid discretization method for naïve Bayesian classifiers , 2012, Pattern Recognit..

[4]  Heiko Hoffmann,et al.  Kernel PCA for novelty detection , 2007, Pattern Recognit..

[5]  José Cristóbal Riquelme Santos,et al.  On the evolutionary optimization of k-NN by label-dependent feature weighting , 2012, Pattern Recognit. Lett..

[6]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[9]  Chengqi Zhang,et al.  POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases , 2009, Expert Syst. Appl..

[10]  Min Li,et al.  An effective discretization based on Class-Attribute Coherence Maximization , 2011, Pattern Recognit. Lett..

[11]  Qinghua Hu,et al.  Feature evaluation and selection based on neighborhood soft margin , 2010, Neurocomputing.

[12]  Kun Chang Lee,et al.  Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets , 2016, Expert Syst. Appl..

[13]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[14]  Ireneusz Czarnowski,et al.  Prototype selection algorithms for distributed learning , 2010, Pattern Recognit..

[15]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[16]  Claudomiro Sales,et al.  Multi-objective genetic algorithm for missing data imputation , 2015, Pattern Recognit. Lett..

[17]  João Miguel da Costa Sousa,et al.  Missing data in medical databases: Impute, delete or classify? , 2013, Artif. Intell. Medicine.

[18]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[19]  Richard Weber,et al.  A wrapper method for feature selection using Support Vector Machines , 2009, Inf. Sci..

[20]  Charu C. Aggarwal,et al.  An Introduction to Outlier Analysis , 2013 .

[21]  K. Mehrotra,et al.  A clustering-based discretization for supervised learning , 2010 .

[22]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[23]  Peter A. N. Bosman,et al.  Symbolic regression and feature construction with GP-GOMEA applied to radiotherapy dose reconstruction of childhood cancer survivors , 2018, GECCO.

[24]  Taghi M. Khoshgoftaar,et al.  Class noise detection using frequent itemsets , 2006, Intell. Data Anal..

[25]  Ireneusz Czarnowski Cluster-based instance selection for machine classification , 2010, Knowledge and Information Systems.

[26]  Selwyn Piramuthu,et al.  Artificial Intelligence and Information Technology Evaluating feature selection methods for learning in data mining applications , 2004 .

[27]  Nicolás García-Pedrajas,et al.  A cooperative coevolutionary algorithm for instance selection for instance-based learning , 2010, Machine Learning.

[28]  Loris Nanni,et al.  Prototype reduction techniques: A comparison among different approaches , 2011, Expert Syst. Appl..

[29]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[30]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Francisco Herrera,et al.  Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes , 2008, Pattern Recognit. Lett..

[32]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[33]  Lukasz A. Kurgan,et al.  Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification , 2010, Knowl. Eng. Rev..

[34]  Francisco Herrera,et al.  A Survey on Evolutionary Instance Selection and Generation , 2010, Int. J. Appl. Metaheuristic Comput..

[35]  T. Kathirvalavakumar,et al.  A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier , 2012, Appl. Soft Comput..

[36]  Xindong Wu,et al.  Mining With Noise Knowledge: Error-Aware Data Mining , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[37]  Carlos Soares,et al.  Entropy-based discretization methods for ranking data , 2016, Inf. Sci..

[38]  B. John Oommen,et al.  A brief taxonomy and ranking of creative prototype reduction schemes , 2003, Pattern Analysis & Applications.

[39]  Ronald K. Pearson,et al.  Mining imperfect data - dealing with contamination and incomplete records , 2005 .

[40]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[41]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[42]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[43]  Bingru Yang,et al.  A SVM Regression Based Approach to Filling in Missing Values , 2005, KES.

[44]  José Ramón Cano,et al.  Strategies for Scaling Up Evolutionary Instance Reduction Algorithms for Data Mining , 2005 .

[45]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[46]  Krzysztof J. Cios,et al.  ur-CAIM: improved CAIM discretization for unbalanced and balanced data , 2016, Soft Comput..

[47]  Lawrence O. Hall,et al.  Active cleaning of label noise , 2016, Pattern Recognit..

[48]  K.Z. Mao,et al.  Orthogonal forward selection and backward elimination algorithms for feature subset selection , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[49]  Davy Janssens,et al.  Evaluating the performance of cost-based discretization versus entropy- and error-based discretization , 2006, Comput. Oper. Res..

[50]  Lukasz A. Kurgan,et al.  Impact of imputation of missing values on classification error for discrete data , 2008, Pattern Recognit..

[51]  Wei Wang,et al.  A comparison of outlier detection algorithms for ITS data , 2010, Expert Syst. Appl..

[52]  José Francisco Martínez Trinidad,et al.  InstanceRank based on borders for instance selection , 2013, Pattern Recognit..

[53]  HerreraFrancisco,et al.  A survey on data preprocessing for data stream mining , 2017 .

[54]  Nicolás García-Pedrajas,et al.  A divide-and-conquer recursive approach for scaling up instance selection algorithms , 2009, Data Mining and Knowledge Discovery.

[55]  David B. Skillicorn,et al.  Distributed prediction from vertically partitioned data , 2008, J. Parallel Distributed Comput..

[56]  Flavio Manenti,et al.  Outlier detection in large data sets , 2011, Comput. Chem. Eng..

[57]  Donghai Guan,et al.  Identifying mislabeled training data with the aid of unlabeled data , 2011, Applied Intelligence.

[58]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[59]  Tapio Elomaa,et al.  Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates , 2004, Data Mining and Knowledge Discovery.

[60]  Francisco Herrera,et al.  A survey on data preprocessing for data stream mining: Current status and future directions , 2017, Neurocomputing.

[61]  Wei-Pang Yang,et al.  A discretization algorithm based on Class-Attribute Contingency Coefficient , 2008, Inf. Sci..

[62]  Derek Greene,et al.  Missing value imputation for epistatic MAPs , 2010, BMC Bioinformatics.

[63]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[64]  Pavel Pudil,et al.  Feature selection toolbox , 2002, Pattern Recognit..

[65]  Stan Matwin,et al.  Ensembles of label noise filters: a ranking approach , 2016, Data Mining and Knowledge Discovery.

[66]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[67]  Sotiris B. Kotsiantis,et al.  Hybrid local boosting utilizing unlabeled data in classification tasks , 2017, Evolving Systems.

[68]  Francisco Herrera,et al.  INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control , 2016, Inf. Fusion.

[69]  Brian Mac Namee,et al.  Profiling instances in noise reduction , 2012, Knowl. Based Syst..

[70]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[71]  Huaiqing Wang,et al.  A discretization algorithm based on a heterogeneity criterion , 2005, IEEE Transactions on Knowledge and Data Engineering.

[72]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[73]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[74]  Larry Bull,et al.  Genetic Programming with a Genetic Algorithm for Feature Construction and Selection , 2005, Genetic Programming and Evolvable Machines.

[75]  Francisco Herrera,et al.  Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics , 2012, Expert Syst. Appl..

[76]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Noise detection in the meta-learning level , 2016, Neurocomputing.

[77]  M. A. H. Farquad,et al.  Preprocessing unbalanced data using support vector machine , 2012, Decis. Support Syst..

[78]  Wenyong Wang,et al.  An efficient instance selection algorithm to reconstruct training set for support vector machine , 2017, Knowl. Based Syst..

[79]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..

[80]  Sven F. Crone,et al.  The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing , 2006, Eur. J. Oper. Res..

[81]  Hugo Jair Escalante,et al.  A Comparison of Outlier Detection Algorithms for Machine Learning , 2005 .

[82]  Li Feng,et al.  Supervised and Adaptive Feature Weighting for Object-Based Classification on Satellite Images , 2018, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[83]  Katsumi Inoue,et al.  Relational Reinforcement Learning for Planning with Exogenous Effects , 2017 .

[84]  Javier Pérez-Rodríguez,et al.  Multi-selection of instances: A straightforward way to improve evolutionary instance selection , 2012, Appl. Soft Comput..

[85]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[86]  Enrico Blanzieri,et al.  Noise reduction for instance-based learning with a local maximal margin approach , 2010, Journal of Intelligent Information Systems.

[87]  Jose Miguel Puerta,et al.  Handling numeric attributes when comparing Bayesian network classifiers: does the discretization method matter? , 2011, Applied Intelligence.

[88]  Ratna Babu Chinnam,et al.  mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification , 2011, Inf. Sci..

[89]  Juan José Rodríguez Diez,et al.  Instance selection of linear complexity for big data , 2016, Knowl. Based Syst..

[90]  Chih-Fong Tsai,et al.  Combining instance selection for better missing value imputation , 2016, J. Syst. Softw..

[91]  Xiaofeng Zhu,et al.  Missing data imputation by utilizing information within incomplete instances , 2011, J. Syst. Softw..

[92]  Xindong Wu,et al.  Discretization Methods , 2010, Data Mining and Knowledge Discovery Handbook.

[93]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[94]  Tommy W. S. Chow,et al.  Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information , 2005, IEEE Transactions on Neural Networks.

[95]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[96]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[97]  Quan Pan,et al.  Adaptive imputation of missing values for incomplete pattern classification , 2016, Pattern Recognit..

[98]  Fabrizio Angiulli,et al.  Exploiting domain knowledge to detect outliers , 2013, Data Mining and Knowledge Discovery.

[99]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[100]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[101]  Yumin Chen,et al.  Neighborhood outlier detection , 2010, Expert Syst. Appl..

[102]  Francisco Herrera,et al.  An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes , 2011, Pattern Recognit..

[103]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[104]  Francisco Herrera,et al.  Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification , 2013, Pattern Recognit..

[105]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[106]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[107]  William Eberle,et al.  Learning to detect representative data for large scale instance selection , 2015, J. Syst. Softw..

[108]  Hossein Nezamabadi-pour,et al.  Using fuzzy-rough set feature selection for feature construction based on genetic programming , 2018, 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC).

[109]  Michael N. Vrahatis,et al.  Particle Swarm Optimization and Intelligence: Advances and Applications , 2010 .

[110]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[111]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[112]  Renato Cordeiro de Amorim,et al.  Feature weighting as a tool for unsupervised feature selection , 2018, Inf. Process. Lett..

[113]  María José del Jesús,et al.  A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets , 2017, Int. J. Neural Syst..

[114]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[115]  Sotiris B. Kotsiantis,et al.  Combining Prototype Selection with Local Boosting , 2016, AIAI.

[116]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[117]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[118]  Marc Boullé,et al.  Khiops: A Statistical Discretization Method of Continuous Attributes , 2004, Machine Learning.

[119]  Elena Marchiori,et al.  Hit Miss Networks with Applications to Instance Selection , 2008, J. Mach. Learn. Res..

[120]  Riyaz Sikora,et al.  Iterative feature construction for improving inductive learning algorithms , 2009, Expert Syst. Appl..

[121]  Ruoming Jin,et al.  Data discretization unification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[122]  Antonio González Muñoz,et al.  Combining instance selection methods based on data characterization: An approach to increase their effectiveness , 2011, Inf. Sci..

[123]  Dong-Chul Park Centroid Neural Network with Weighted Features , 2009, J. Circuits Syst. Comput..

[124]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[125]  Chengqi Zhang,et al.  Missing Value Imputation Based on Data Clustering , 2008, Trans. Comput. Sci..

[126]  Fan Ming-hui Review of Outlier Detection , 2006 .

[127]  Heiko Wersing,et al.  Incremental on-line learning: A review and comparison of state of the art algorithms , 2018, Neurocomputing.

[128]  David Zhang,et al.  Hand-Geometry Recognition Using Entropy-Based Discretization , 2007, IEEE Transactions on Information Forensics and Security.

[129]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[130]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[131]  Hong Shen,et al.  Multi-criteria feature selection on cost-sensitive data with missing values , 2016, Pattern Recognit..

[132]  Luis González Abril,et al.  Ameva: An autonomous discretization algorithm , 2009, Expert Syst. Appl..

[133]  Chuanhe Shen,et al.  Feature Weighting of Support Vector Machines Based on Derivative Saliency Analysis and Its Application to Financial Data Mining , 2012 .

[134]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[135]  Francisco Herrera,et al.  A memetic algorithm for evolutionary prototype selection: A scaling up approach , 2008, Pattern Recognit..

[136]  Ming-jian Zhou,et al.  An Outlier Mining Algorithm Based on Dissimilarity , 2012 .

[137]  Francisco Herrera,et al.  IFS-CoCo: Instance and feature selection based on cooperative coevolution with nearest neighbor rule , 2010, Pattern Recognit..

[138]  Rich Caruana,et al.  Benefitting from the Variables that Variable Selection Discards , 2003, J. Mach. Learn. Res..

[139]  Bokyoung Kang,et al.  Fast outlier detection for very large log data , 2011, Expert Syst. Appl..

[140]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[141]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[142]  Francisco Herrera,et al.  Integrating a differential evolution feature weighting scheme into prototype generation , 2012, Neurocomputing.

[143]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[144]  Q. Henry Wu,et al.  A class boundary preserving algorithm for data condensation , 2011, Pattern Recognit..

[145]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.