Data Preprocessing for Supervised Leaning

Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set. Keywords—Data mining, feature selection, data cleaning.

[1]  Huan Liu,et al.  Some issues on scalable feature selection , 1998 .

[2]  Huan Liu,et al.  Neural-network feature selector , 1997, IEEE Trans. Neural Networks.

[3]  Yuh-Jyh Hu,et al.  Generation of Attributes for Learning Algorithms , 1996, AAAI/IAAI, Vol. 1.

[4]  Wolfgang Maass,et al.  Efficient agnostic PAC-learning with simple hypothesis , 1994, COLT '94.

[5]  Anoop Sarkar,et al.  Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) , 2003 .

[6]  Tapio Elomaa,et al.  Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates , 2004, Data Mining and Knowledge Discovery.

[7]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[8]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[9]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[10]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[11]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[12]  Selwyn Piramuthu,et al.  Artificial Intelligence and Information Technology Evaluating feature selection methods for learning in data mining applications , 2004 .

[13]  Bernhard Pfahringer,et al.  Compression-Based Discretization of Continuous Attributes , 1995, ICML.

[14]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[15]  Paul D. Scott,et al.  Reducing decision tree fragmentation through attribute value grouping: A comparative study , 2000, Intell. Data Anal..

[16]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[17]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Pavel Pudil,et al.  Feature selection toolbox , 2002, Pattern Recognit..

[22]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[23]  Shaul Markovitch,et al.  Feature Generation Using General Constructor Functions , 2002, Machine Learning.

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[26]  Marc Boullé,et al.  Khiops: A Statistical Discretization Method of Continuous Attributes , 2004, Machine Learning.

[27]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[28]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[29]  Frantisek Franek,et al.  Comparison of Various Routines for Unknown Attribute Value Processing The Covering Paradigm , 1996, Int. J. Pattern Recognit. Artif. Intell..

[30]  Venansius Baryamureeba,et al.  PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 8 , 2005 .

[31]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[32]  Kenneth W. Bauer,et al.  Feature screening using signal-to-noise ratios , 2000, Neurocomputing.

[33]  Gregory M. Provan,et al.  Efficient Learning of Selective Bayesian Network Classifiers , 1996, ICML.

[34]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[35]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[36]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[37]  Zijian Zheng,et al.  Constructing X-of-N Attributes for Decision Tree Learning , 2000, Machine Learning.

[38]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[39]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[40]  Pavel Paclík,et al.  Adaptive floating search methods in feature selection , 1999, Pattern Recognit. Lett..

[41]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[42]  J. Preston Ξ-filters , 1983 .

[43]  Jerome H. Friedman,et al.  DATA MINING AND STATISTICS: WHAT''S THE CONNECTION , 1997 .

[44]  Marek Grochowski,et al.  Comparison of Instance Selection Algorithms II. Results and Comments , 2004, ICAISC.

[45]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[46]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[47]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[48]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[49]  A. S. Thoke,et al.  International Journal of Electrical and Computer Engineering 3:16 2008 Fault Classification of Double Circuit Transmission Line Using Artificial Neural Network , 2022 .

[50]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[51]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[52]  José Ramón Cano,et al.  Strategies for Scaling Up Evolutionary Instance Reduction Algorithms for Data Mining , 2005 .