An adaptive threshold determination method of feature screening for genomic selection

BackgroundAlthough the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BE-IDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches.ResultsVerified through six simulations, the adaptive threshold estimated by the BE-IDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BE-IDC to an Arabidopsis thaliana genome-wide data. Out of 216,130 SNPs, BE-IDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods.ConclusionsBE-IDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[3]  I. Vlahavas,et al.  Machine Learning and Data Mining Methods in Diabetes Research , 2017, Computational and structural biotechnology journal.

[4]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[5]  Runze Li,et al.  Feature Screening for Ultrahigh Dimensional Categorical Data With Applications , 2013, Journal of business & economic statistics : a publication of the American Statistical Association.

[6]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[7]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Wei Zhong,et al.  A selective overview of feature screening for ultrahigh-dimensional data , 2015 .

[10]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[11]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[12]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[13]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[14]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[15]  David B. Goldstein,et al.  Genomics: Understanding human diversity , 2005, Nature.

[16]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[17]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[18]  Ivan Merelli,et al.  SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS , 2013, BMC Bioinformatics.

[19]  G. Wahba,et al.  Using distance covariance for improved variable selection with application to learning genetic risk models , 2015, Statistics in medicine.

[20]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[21]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[22]  Peter Hall,et al.  Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[23]  Masao Ueki,et al.  Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis , 2012, BMC Bioinformatics.

[24]  Runze Li,et al.  Feature Selection for Varying Coefficient Models With Ultrahigh-Dimensional Covariates , 2014, Journal of the American Statistical Association.

[25]  J. Ott,et al.  Selecting SNPs in two‐stage analysis of disease association data: a model‐free approach , 2000, Annals of human genetics.

[26]  D. Akdemir,et al.  Genomic Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines , 2015, PLoS genetics.

[27]  Runze Li,et al.  Model-Free Feature Screening for Ultrahigh-Dimensional Data , 2011, Journal of the American Statistical Association.

[28]  A. Korte,et al.  The advantages and limitations of trait analysis with GWAS: a review , 2013, Plant Methods.

[29]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[30]  Xingju Cai,et al.  A proximal alternating linearization method for minimizing the sum of two convex functions , 2015 .

[31]  G. Tamiya,et al.  Smooth‐Threshold Multivariate Genetic Prediction with Unbiased Model Selection , 2016, Genetic epidemiology.

[32]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[33]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[34]  Jeffrey E. Lee,et al.  Genome-wide association study identifies three new melanoma susceptibility loci , 2011, Nature Genetics.

[35]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[36]  Neil D. Rawlings,et al.  New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily , 2014, BMC Bioinformatics.

[37]  Maria L. Rizzo,et al.  Measuring and testing dependence by correlation of distances , 2007, 0803.4101.

[38]  Liping Zhu,et al.  An iterative approach to distance correlation-based sure independence screening† , 2015 .

[39]  Xavier Draye,et al.  Novel scanning procedure enabling the vectorization of entire rhizotron-grown root systems , 2013, Plant Methods.

[40]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[41]  Todd A. Johnson,et al.  A genome-wide association study identifies common variants near LBX1 associated with adolescent idiopathic scoliosis , 2011, Nature Genetics.

[42]  Wei-Chung Cheng,et al.  Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm , 2014, BMC Bioinformatics.

[43]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..