Kernel methods for large-scale genomic data analysis

Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today's explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion.

[1]  Yiming Yang,et al.  From Lasso regression to Feature vector machine , 2005, NIPS.

[2]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[3]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[4]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[5]  Dawei Liu,et al.  Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models , 2008, BMC Bioinformatics.

[6]  Kevin Y. Yip,et al.  Machine learning and genome annotation: a match meant to be? , 2013, Genome Biology.

[7]  Daniel J Schaid,et al.  Multiple Genetic Variant Association Testing by Collapsing and Kernel Methods With Pedigree or Population Structured Data , 2013, Genetic epidemiology.

[8]  Seunghak Lee,et al.  Adaptive Multi-Task Lasso: with Application to eQTL Detection , 2010, NIPS.

[9]  Inderjit S. Dhillon,et al.  Memory Efficient Kernel Approximation , 2014, ICML.

[10]  Qing Zhao,et al.  Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA , 2015, Briefings Bioinform..

[11]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[12]  Daniel J Schaid,et al.  Genomic Similarity and Kernel Methods II: Methods for Genomic Information , 2010, Human Heredity.

[13]  Kent A Weigel,et al.  Predicting complex traits using a diffusion kernel on genetic markers with an application to dairy cattle and wheat data , 2013, Genetics Selection Evolution.

[14]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[15]  Seunghak Lee,et al.  Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs , 2012, Bioinform..

[16]  Emilio Porcu,et al.  Predicting Genetic Values: A Kernel-Based Best Linear Unbiased Prediction With Genomic Data , 2011, Genetics.

[17]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[18]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[19]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[20]  Xihong Lin,et al.  A powerful and flexible multilocus association test for quantitative traits. , 2008, American journal of human genetics.

[21]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[22]  Xihong Lin,et al.  Optimal tests for rare variant effects in sequencing association studies. , 2012, Biostatistics.

[23]  D. Gianola,et al.  Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits , 2008, Genetics.

[24]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[25]  Xihong Lin,et al.  Semiparametric Regression of Multidimensional Genetic Pathway Data: Least‐Squares Kernel Machines and Linear Mixed Models , 2007, Biometrics.

[26]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[27]  Min A. Jhun,et al.  SNP Set Association Analysis for Familial Data , 2012, Genetic epidemiology.

[28]  J. Meigs,et al.  Sequence Kernel Association Test for Quantitative Traits in Family Samples , 2013, Genetic epidemiology.

[29]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[30]  Tianxi Cai,et al.  Kernel Machine Approach to Testing the Significance of Multiple Genetic Markers for Risk Prediction , 2011, Biometrics.

[31]  Colin Campbell,et al.  A pathway-based data integration framework for prediction of disease progression , 2013, Bioinform..

[32]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[33]  Daniel J Schaid,et al.  Genomic Similarity and Kernel Methods I: Advancements by Building on Mathematical and Statistical Foundations , 2010, Human Heredity.

[34]  Ujjwal Maulik,et al.  Ensemble learning prediction of protein-protein interactions using proteins functional annotations. , 2014, Molecular bioSystems.

[35]  Patricio S La Rosa,et al.  Biogeography of the ecosystems of the healthy human body , 2013, Genome Biology.

[36]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[37]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[38]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[39]  Peter Kraft,et al.  Inclusion of gene-gene and gene-environment interactions unlikely to dramatically improve risk prediction for complex diseases. , 2012, American journal of human genetics.

[40]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[41]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[42]  Curtis B. Storlie,et al.  Reproducing Kernel Hilbert Spaces for Penalized Regression: A Tutorial , 2012 .

[43]  Deanne M. Taylor,et al.  Powerful SNP-set analysis for case-control genome-wide association studies. , 2010, American journal of human genetics.

[44]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[45]  Yuehua Cui,et al.  Gene-centric gene–gene interaction: A model-based kernel machine method , 2012, 1209.6502.

[46]  R. Fernando,et al.  Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor , 2013, PLoS genetics.

[47]  Xihong Lin,et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. , 2011, American journal of human genetics.

[48]  Xi Chen,et al.  Group Sparse Additive Models , 2012, ICML.

[49]  Bart De Moor,et al.  Kernel-based Data Fusion for Machine Learning - Methods and Applications in Bioinformatics and Text Mining , 2009, Studies in Computational Intelligence.

[50]  D. Schaid,et al.  A Kernel Regression Approach to Gene‐Gene Interaction Detection for Case‐Control Studies , 2013, Genetic epidemiology.

[51]  L C Kwee,et al.  Simple methods for assessing haplotype‐environment interactions in case‐only and case‐control studies , 2007, Genetic epidemiology.

[52]  E. Xing,et al.  Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network , 2009, PLoS genetics.

[53]  N. Schork,et al.  Generalized genomic distance-based regression methodology for multilocus association analysis. , 2006, American journal of human genetics.

[54]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.