Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter

An algorithm for filtering information based on the Kolmogorov-Smirnov correlation-based approach has been implemented and tested on feature selection. The only parameter of this algorithm is statistical confidence level that two distributions are identical. Empirical comparisons with 4 other state-of-the-art features selection algorithms (FCBP, CorrSF, ReliefF and ConnSF) are very encouraging.

[1]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[2]  William H. Press,et al.  Numerical recipes in C , 2002 .

[3]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[4]  Włodzisław Duch,et al.  Feature Ranking , Selection and Discretization , 2003 .

[5]  Godfried T. Toussaint,et al.  Comments on 'A modified figure of merit for feature selection in pattern recognition' by Paul, J. E., Jr., et al , 1971, IEEE Trans. Inf. Theory.

[6]  Godfried T. Toussaint,et al.  Note on optimal selection of independent binary-valued features for pattern recognition (Corresp.) , 1971, IEEE Trans. Inf. Theory.

[7]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[8]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[9]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[10]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[11]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  B. P. Murphy,et al.  Handbook of Methods of Applied Statistics , 1968 .

[15]  M. Evans Statistical Distributions , 2000 .

[16]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[18]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[19]  Thomas M. Cover,et al.  The Best Two Independent Measurements Are Not the Two Best , 1974, IEEE Trans. Syst. Man Cybern..