Applications of Machine Learning and Data Mining Methods to Detect Associations of Rare and Common Variants with Complex Traits

Machine learning methods (MLMs), designed to develop models using high‐dimensional predictors, have been used to analyze genome‐wide genetic and genomic data to predict risks for complex traits. We summarize the results from six contributions to our Genetic Analysis Workshop 18 working group; these investigators applied MLMs and data mining to analyses of rare and common genetic variants measured in pedigrees. To develop risk profiles, group members analyzed blood pressure traits along with single‐nucleotide polymorphisms and rare variant genotypes derived from sequence and imputation analyses in large Mexican American pedigrees. Supervised MLMs included penalized regression with varying penalties, support vector machines, and permanental classification. Unsupervised MLMs included sparse principal components analysis and sparse graphical models. Entropy‐based components analyses were also used to mine these data. None of the investigators fully capitalized on the genetic information provided by the complete pedigrees. Their approaches either corrected for the nonindependence of the individuals within the pedigrees or analyzed only those who were independent. Some methods allowed for covariate adjustment, whereas others did not. We evaluated these methods using a variety of metrics. Four contributors conducted primary analyses on the real data, and the other two research groups used the simulated data with and without knowledge of the underlying simulation model. One group used the answers to the simulated data to assess power and type I errors. Although the MLMs applied were substantially different, each research group concluded that MLMs have advantages over standard statistical approaches with these high‐dimensional data.

[1]  Peter McCullagh,et al.  Stochastic classification models , 2006 .

[2]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[3]  J. Beyene,et al.  Entropy-based method for assessing the influence of genetic markers and covariates on hypertension: application to Genetic Analysis Workshop 18 data , 2014, BMC Proceedings.

[4]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  S. Shete,et al.  Gaussian graphical models for phenotypes using pedigree data and exploratory analysis using networks with genetic and nongenetic factors based on Genetic Analysis Workshop 18 data , 2014, BMC Proceedings.

[7]  R. Fletcher Practical Methods of Optimization , 1988 .

[8]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[9]  Hsin-Hsiung Huang,et al.  Comparing logistic regression, support vector machines, and permanental classification methods in predicting hypertension , 2014, BMC Proceedings.

[10]  P. McCullagh,et al.  Classification Based on Permanental Process with Cyclic Approximations , 2011, 1108.4920.

[11]  Xiaotong Shen,et al.  Journal of the American Statistical Association Likelihood-based Selection and Sharp Parameter Estimation Likelihood-based Selection and Sharp Parameter Estimation , 2022 .

[12]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[13]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[14]  Xiaotong Shen,et al.  Does the inclusion of rare variants improve risk prediction? , 2014, BMC Proceedings.

[15]  Joseph Beyene,et al.  Testing for associations between systolic blood pressure and single-nucleotide polymorphism profiles obtained from sparse principal component analysis , 2014, BMC Proceedings.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Rita M Cantor,et al.  Identifying rare-variant associations in parent-child trios using a Gaussian support vector machine , 2014, BMC Proceedings.

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  Momiao Xiong,et al.  An entropy-based statistic for genomewide association studies. , 2005, American journal of human genetics.