POP: A Parallel Optimized Preparation of data for data mining

In light of the fact that data preparation has a substantial impact on data mining results, we provide an original framework for automatically preparing the data of any given database. Our research focuses, for each attribute of the database, on two points: (i) Specifying an optimized outlier detection method, and (ii), Identifying the most appropriate discretization method. Concerning the former, we illustrate that the detection of an outlier depends on if data distribution is normal or not. When attempting to discern the best discretization method, what is important is the shape followed by the density function of its distribution law. For this reason, we propose an automatic choice for finding the optimized discretization method based on a multi-criteria (Entropy, Variance, Stability) evaluation. Processings are performed in parallel using multicore capabilities. Conducted experiments validate our approach, showing that it is not always the very same discretization method that is the best.

[1]  P. Vincke,et al.  Multicriteria analysis: survey and new directions , 1981 .

[2]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[3]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[4]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[5]  Alain Casali,et al.  Extracting Correlated Patterns on Multicore Architectures , 2013, CD-ARES.

[6]  Ronaldo Dias,et al.  A Review of Kernel Density Estimation with Applications to Econometrics , 2012, 1212.2812.

[7]  G. Jenks The Data Model Concept in Statistical Mapping , 1967 .

[8]  Bodo Manthey,et al.  Smoothed Analysis of the k-Means Method , 2011, JACM.

[9]  Harold R. Lindman Analysis of Variance in Experimental Design , 1991 .

[10]  M. Grun-Réhomme,et al.  Méthodes de détection des unités atypiques:Cas des enquêtes structurelles ukrainiennes , 2010 .

[11]  Panos M. Pardalos,et al.  Advances in multicriteria analysis , 1995 .

[12]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[13]  Petr Aubrecht,et al.  Preprocessing for Data Mining and Decision Support , 2003 .

[14]  Panos M. Pardalos,et al.  Handbook of Multicriteria Analysis , 2010 .

[15]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[16]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[17]  Anil K. Bera,et al.  Efficient tests for normality, homoscedasticity and serial independence of regression residuals , 1980 .

[18]  Alain Casali,et al.  Data Preparation in the MineCor KDD Framework , 2011 .

[19]  Koen Vanhoof,et al.  Comparison of Discretization Methods for Preprocessing Data for Pyramidal Growing Network Classification Method , 2009 .

[20]  Kwang-Ho Ro,et al.  Outlier detection for high-dimensional data , 2015 .

[21]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[22]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[23]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..