Maximum Distance Minimum Error (MDME): A non-parametric approach to feature selection for image-based high content screening data

Feature selection is a necessary preprocessing step in data analytics. Most distribution-based feature selection algorithms are parametric approaches that assume a normal distribution for the data. Often times, however, real world data do not follow a normal distribution, instead following a lognormal distribution. This is especially true in biology where latent factors often dictate distribution patterns. Parametric-based approaches are not well suited for this type of distribution. We propose the Maximum Distance Minimum Error (MDME) method, a non-parametric approach capable of handling both normal and log-normal data sets. The MDME method is based on the Kolmogorov-Smirnov test, which is well known for its ability to accurately test the dependency between two distributions without normal distribution assumptions. We test our MDME method on multiple datasets and demonstrate that our approach performs comparable to and often times better than the traditional parametric-based approaches.

[1]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[2]  Huan Liu,et al.  Cell analytics in compound hit selection of bacterial inhibitors , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[3]  Anne E Carpenter,et al.  Increasing the Content of High-Content Screening , 2014, Journal of biomolecular screening.

[4]  M. McHugh,et al.  The Chi-square test of independence , 2013, Biochemia medica.

[5]  M. Gerritsen,et al.  High-Content, High-Throughput Analysis of Cell Cycle Perturbations Induced by the HSP90 Inhibitor XL888 , 2011, PloS one.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Julie Gorenstein,et al.  Reducing the multidimensionality of high-content screening into versatile powerful descriptors. , 2010, BioTechniques.

[8]  S. Bowling,et al.  A Logistic Approximation to The Cumulative Normal Distribution , 2009 .

[9]  Robert Nadon,et al.  Statistical practice in high-throughput screening data analysis , 2006, Nature Biotechnology.

[10]  Lani F. Wu,et al.  Multidimensional Drug Profiling By Automated Microscopy , 2004, Science.

[11]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[13]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[14]  F. Lampariello,et al.  On the use of the Kolmogorov-Smirnov statistical test for immunofluorescence histogram comparison. , 2000, Cytometry.

[15]  Thomas D. Y. Chung,et al.  A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays , 1999, Journal of biomolecular screening.

[16]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[17]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[18]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[19]  I. Gibson Statistics and Data Analysis in Geology , 1976, Mineralogical Magazine.

[20]  S. Wright THE INTERPRETATION OF POPULATION STRUCTURE BY F‐STATISTICS WITH SPECIAL REGARD TO SYSTEMS OF MATING , 1965 .

[21]  K Subrahmanyam,et al.  A Modified KS-test for Feature Selection , 2013 .

[22]  Marcin Blachnik,et al.  Feature Selection for Supervised Classification : A Kolmogorov-Smirnov Class Correlation-Based Filter , 2009 .

[23]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[24]  N. Schork,et al.  On the asymmetry of biological frequency distributions , 1990, Genetic epidemiology.