Statistical Inference Relief (STIR) feature selection

Motivation Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. Methods We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data. Results We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when thefixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR’s straightforward extension to genome-wide association studies. Availability Code and data available at http://insilico.utulsa.edu/software/STIR. Contact brett.mckinney@gmail.com

[1]  Nicholas M. Pajewski,et al.  Six Degrees of Epistasis: Statistical Network Models for GWAS , 2011, Front. Gene..

[2]  Jianxin Shi,et al.  Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs , 2013, Nature Genetics.

[3]  Bill C. White,et al.  Identification and replication of RNA-Seq gene network modules associated with depression severity , 2018, Translational Psychiatry.

[4]  Bill C. White,et al.  Differential privacy‐based evaporative cooling feature selection and classification with relief‐F and random forests , 2017, Bioinform..

[5]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[6]  Casey S. Greene,et al.  IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks , 2012, Nucleic Acids Res..

[7]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[8]  Rui Mei,et al.  Type I interferon signaling genes in recurrent major depression: increased expression detected by whole-blood RNA sequencing , 2013, Molecular Psychiatry.

[9]  Claire Redin,et al.  Herzig Pyruvate Carrier Identification and Functional Expression of the Mitochondrial , 2012 .

[10]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[11]  Bill C. White,et al.  ReliefSeq: A Gene-Wise Adaptive-K Nearest-Neighbor Feature Selection Tool for Finding Gene-Gene Interactions and Main Effects in mRNA-Seq Gene Expression Data , 2013, PloS one.

[12]  Randal S. Olson,et al.  Benchmarking Relief-Based Feature Selection Methods , 2017, J. Biomed. Informatics.

[13]  Y. Benjamini,et al.  Controlling the false discovery rate in behavior genetics research , 2001, Behavioural Brain Research.

[14]  Ben Lehner,et al.  Epigenetic epistatic interactions constrain the evolution of gene expression , 2013, Molecular systems biology.

[15]  Ming Li,et al.  Replication of Han Chinese GWAS loci for schizophrenia via meta-analysis of four independent samples , 2016, Schizophrenia Research.

[16]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[17]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[18]  Bill C. White,et al.  Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure , 2015, BioData Mining.

[19]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[20]  B. McKinney,et al.  Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis , 2009, PLoS genetics.