A Novel Support Vector Machine-Based Approach for Rare Variant Detection

Advances in next-generation sequencing technologies have enabled the identification of multiple rare single nucleotide polymorphisms involved in diseases or traits. Several strategies for identifying rare variants that contribute to disease susceptibility have recently been proposed. An important feature of many of these statistical methods is the pooling or collapsing of multiple rare single nucleotide variants to achieve a reasonably high frequency and effect. However, if the pooled rare variants are associated with the trait in different directions, then the pooling may weaken the signal, thereby reducing its statistical power. In the present paper, we propose a backward support vector machine (BSVM)-based variant selection procedure to identify informative disease-associated rare variants. In the selection procedure, the rare variants are weighted and collapsed according to their positive or negative associations with the disease, which may be associated with common variants and rare variants with protective, deleterious, or neutral effects. This nonparametric variant selection procedure is able to account for confounding factors and can also be adopted in other regression frameworks. The results of a simulation study and a data example show that the proposed BSVM approach is more powerful than four other approaches under the considered scenarios, while maintaining valid type I errors.

[1]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[2]  J. Todd,et al.  Rare Variants of IFIH1, a Gene Implicated in Antiviral Responses, Protect Against Type 1 Diabetes , 2009, Science.

[3]  Kathryn Roeder,et al.  Testing for an Unusual Distribution of Rare Variants , 2011, PLoS genetics.

[4]  G. Simpson,et al.  Genetics, paleontology, and evolution. , 1949 .

[5]  Shamil R Sunyaev,et al.  Pooled association tests for rare variants in exon-resequencing studies. , 2010, American journal of human genetics.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[8]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[9]  Kyle A. McQuisten,et al.  Comparing Artificial Neural Networks, General Linear Models and Support Vector Machines in Building Predictive Models for Small Interfering RNAs , 2009, PloS one.

[10]  The comparison of parameters estimated from several different samples by maximum likelihood. , 1976, Biometrics.

[11]  Ruchika Malhotra,et al.  Software Maintainability Prediction using Machine Learning Algorithms , 2012 .

[12]  Hua Zhou,et al.  Association screening of common and rare genetic variants by penalized regression , 2010, Bioinform..

[13]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[14]  Andrej-Nikolai Spiess,et al.  An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach , 2010, BMC pharmacology.

[15]  Chun-Wu Yeh,et al.  Acquiring knowledge with limited experience , 2007, Expert Syst. J. Knowl. Eng..

[16]  C. Weinberg,et al.  Use and misuse of population attributable fractions. , 1998, American journal of public health.

[17]  E. Mayr Adaptation and selection , 1981 .

[18]  Shyh-Huei Chen,et al.  A support vector machine approach for detecting gene‐gene interaction , 2008, Genetic epidemiology.

[19]  Chin-Teng Lin,et al.  An Automatic Method for Selecting the Parameter of the Normalized Kernel Function to Support Vector Machines , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[20]  Yao-Hwei Fang,et al.  SVM‐Based Generalized Multifactor Dimensionality Reduction Approaches for Detecting Gene‐Gene Interactions in Family Studies , 2012, Genetic epidemiology.

[21]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[22]  Wei Pan,et al.  A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants , 2010, Human Heredity.

[23]  S. Browning,et al.  A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic , 2009, PLoS genetics.

[24]  Yun Li,et al.  Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. , 2010, American journal of human genetics.

[25]  Chengqing Wu,et al.  Disease risk prediction with rare and common variants , 2011, BMC proceedings.

[26]  Iuliana Ionita-Laza,et al.  A New Testing Strategy to Identify Rare Variants with Either Risk or Protective Effect on Disease , 2011, PLoS genetics.

[27]  S. Leal,et al.  Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. , 2008, American journal of human genetics.

[28]  Dan-Yu Lin,et al.  A general framework for detecting disease associations with rare variants in sequencing studies. , 2011, American journal of human genetics.

[29]  E. Zeggini,et al.  An Evaluation of Statistical Approaches to Rare Variant Analysis in Genetic Association Studies , 2009, Genetic epidemiology.