Deciphering the complex: Methodological overview of statistical models to derive OMICS‐based biomarkers

Recent technological advances in molecular biology have given rise to numerous large‐scale datasets whose analysis imposes serious methodological challenges mainly relating to the size and complex structure of the data. Considerable experience in analyzing such data has been gained over the past decade, mainly in genetics, from the Genome‐Wide Association Study era, and more recently in transcriptomics and metabolomics. Building upon the corresponding literature, we provide here a nontechnical overview of well‐established methods used to analyze OMICS data within three main types of regression‐based approaches: univariate models including multiple testing correction strategies, dimension reduction techniques, and variable selection models. Our methodological description focuses on methods for which ready‐to‐use implementations are available. We describe the main underlying assumptions, the main features, and advantages and limitations of each of the models. This descriptive summary constitutes a useful tool for driving methodological choices while analyzing OMICS data, especially in environmental epidemiology, where the emergence of the exposome concept clearly calls for unified methods to analyze marginally and jointly complex exposure and OMICS datasets. Environ. Mol. Mutagen. 54:542‐557, 2013. © 2013 Wiley Periodicals, Inc.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  Stephen M Rappaport,et al.  Environment and Disease Risks , 2010, Science.

[3]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[4]  Hao Li,et al.  Analysis of oligonucleotide array experiments with repeated measures using mixed models , 2004, BMC Bioinformatics.

[5]  C. Wild,et al.  The exposome: from concept to utility. , 2012, International journal of epidemiology.

[6]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[7]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[8]  Zhiwu Zhang,et al.  Mixed linear model approach adapted for genome-wide association studies , 2010, Nature Genetics.

[9]  Nicolas Chopin,et al.  Sequential Monte Carlo on large binary sampling spaces , 2011, Statistics and Computing.

[10]  James G. Scott,et al.  Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction , 2022 .

[11]  R. Tibshirani,et al.  Generalized additive models for medical research , 1995, Statistical methods in medical research.

[12]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[13]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[14]  M. Rantalainen,et al.  Statistically integrated metabonomic-proteomic studies on a human prostate cancer xenograft model in mice. , 2006, Journal of proteome research.

[15]  S. Wood Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models , 2011 .

[16]  Thibaut Jombart,et al.  adegenet 1.3-1: new tools for the analysis of genome-wide SNP data , 2011, Bioinform..

[17]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[18]  Fredrik Barrenäs,et al.  Identification of Novel Biomarkers in Seasonal Allergic Rhinitis by Combining Proteomic, Multivariate and Pathway Analysis , 2011, PloS one.

[19]  Kristin E. Porter,et al.  Global Gene Expression Profiling of a Population Exposed to a Range of Benzene Levels , 2010, Environmental health perspectives.

[20]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[21]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[22]  S. Wood Stable and Efficient Multiple Smoothing Parameter Estimation for Generalized Additive Models , 2004 .

[23]  Seán G. Brady,et al.  The Importance of Using Multiple Approaches for Identifying Emerging Invasive Species: The Case of the Rasberry Crazy Ant in the United States , 2012, PloS one.

[24]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[25]  T Jombart,et al.  Genetic markers in the playground of multivariate analysis , 2009, Heredity.

[26]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[27]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[28]  Stephen M Rappaport,et al.  Biomarkers intersect with the exposome , 2012, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[29]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[30]  Johan Trygg,et al.  Integrated analysis of transcript, protein and metabolite data to study lignin biosynthesis in hybrid aspen. , 2009, Journal of proteome research.

[31]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[32]  Alex Lewin,et al.  A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments , 2004, Bioinform..

[33]  Anne-Béatrice Dufour,et al.  The ade4 Package: Implementing the Duality Diagram for Ecologists , 2007 .

[34]  Kim-Anh Lê Cao,et al.  A novel approach for biomarker selection and the integration of repeated measures experiments from two assays , 2012, BMC Bioinformatics.

[35]  Scott C Schmidler,et al.  BAYESIAN MODEL SEARCH AND MULTILEVEL INFERENCE FOR SNP ASSOCIATION STUDIES. , 2009, The annals of applied statistics.

[36]  L. Cavalli-Sforza Population structure and human evolution , 1966, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[37]  K. Lange,et al.  Prioritizing GWAS results: A review of statistical methods and recommendations for their application. , 2010, American journal of human genetics.

[38]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[39]  Marc Chadeau-Hyam,et al.  Metabolome-wide association study identifies multiple biomarkers that discriminate north and south Chinese populations at differing risks of cardiovascular disease: INTERMAP study. , 2010, Journal of proteome research.

[40]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[41]  Hyonho Chun,et al.  Expression Quantitative Trait Loci Mapping With Multivariate Sparse Partial Least Squares Regression , 2009, Genetics.

[42]  Fei Zou,et al.  An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait Loci , 2004, Genetics.

[43]  T. Fearn,et al.  Bayes model averaging with selection of regressors , 2002 .

[44]  David Tritchler,et al.  Genome-wide sparse canonical correlation of gene expression with genotypes , 2007, BMC proceedings.

[45]  Tommy Löfstedt,et al.  OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation , 2011 .

[46]  Y. Escoufier,et al.  The Duality Diagram: A Means for Better Practical Applications , 1987 .

[47]  John M. Walker,et al.  Metabolic Profiling , 2011, Methods in Molecular Biology.

[48]  David Reich,et al.  Principal component analysis of genetic data , 2008, Nature Genetics.

[49]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[50]  Satkartar K. Kinney,et al.  Fixed and Random Effects Selection in Linear and Logistic Models , 2007, Biometrics.

[51]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[53]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[54]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[55]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[56]  Timothy M. D. Ebbels,et al.  The evolution of partial least squares models and related chemometric approaches in metabonomics and metabolic phenotyping , 2010 .

[57]  I. King Jordan,et al.  On the presence and role of human gene-body DNA methylation , 2012, Oncotarget.

[58]  John D. Storey A direct approach to false discovery rates , 2002 .

[59]  Sylvia Richardson,et al.  Evolutionary Stochastic Search for Bayesian model exploration , 2010, 1002.2706.

[60]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[61]  Maria De Iorio,et al.  Bayesian survival analysis in genetic association studies , 2008, Bioinform..

[62]  Bjarni J. Vilhjálmsson,et al.  An efficient multi-locus mixed model approach for genome-wide association studies in structured populations , 2012, Nature Genetics.

[63]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[64]  Daniel Eriksson,et al.  Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. , 2007, The Plant journal : for cell and molecular biology.

[65]  Meili Baragatti,et al.  Bayesian Variable Selection for Probit Mixed Models Applied to Gene Selection , 2011, 1101.4577.

[66]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[67]  Giuseppe Musumarra,et al.  OPLS-DA as a suitable method for selecting a set of gene transcripts discriminating RAS- and PTPN11-mutated cells in acute lymphoblastic leukaemia. , 2011, Combinatorial chemistry & high throughput screening.

[68]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[69]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[70]  S. Wood,et al.  Generalized Additive Models: An Introduction with R , 2006 .

[71]  M. Clyde,et al.  Mixtures of g Priors for Bayesian Variable Selection , 2008 .

[72]  Sylvia Richardson,et al.  Bayesian Detection of Expression Quantitative Trait Loci Hot Spots , 2011, Genetics.

[73]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[74]  Manuele Bicego,et al.  The Grapevine Expression Atlas Reveals a Deep Transcriptome Shift Driving the Entire Plant into a Maturation Program[W][OA] , 2012, Plant Cell.

[75]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[76]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[77]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[78]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[79]  J. Trygg O2‐PLS for qualitative and quantitative analysis in multivariate calibration , 2002 .

[80]  Qing Li,et al.  The Bayesian elastic net , 2010 .

[81]  H. Tapp,et al.  Patterns of DNA methylation in individual colonic crypts reveal aging and cancer-related field defects in the morphologically normal mucosa. , 2010, Carcinogenesis.

[82]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[83]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[84]  T. Hothorn,et al.  Multiple Comparisons Using R , 2010 .

[85]  P. Mendes,et al.  The origin of correlations in metabolomics data , 2005, Metabolomics.

[86]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[87]  Robert Kohn,et al.  Nonparametric regression using linear combinations of basis functions , 2001, Stat. Comput..

[88]  Johan Trygg,et al.  O2‐PLS, a two‐block (X–Y) latent variable regression (LVR) method with an integral OSC filter , 2003 .

[89]  C. Hoggart,et al.  Genome‐wide significance for dense SNP and resequencing data , 2008, Genetic epidemiology.

[90]  F. Balloux,et al.  Discriminant analysis of principal components: a new method for the analysis of genetically structured populations , 2010, BMC Genetics.

[91]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[92]  Marcos Dipinto,et al.  Discriminant analysis , 2020, Predictive Analytics.

[93]  A. Chess,et al.  Gene Body-Specific Methylation on the Active X Chromosome , 2007, Science.

[94]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[95]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[96]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[97]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[98]  Mark I. McCarthy,et al.  Genome-Wide Association Study Reveals Multiple Loci Associated with Primary Tooth Development during Infancy , 2010, PLoS genetics.

[99]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[100]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[101]  Z. Šidák Rectangular Confidence Regions for the Means of Multivariate Normal Distributions , 1967 .

[102]  D. Bates,et al.  Nonlinear mixed effects models for repeated measures data. , 1990, Biometrics.

[103]  Marc Chadeau-Hyam,et al.  Metabolic profiling and the metabolome-wide association study: significance level for biomarker identification. , 2010, Journal of proteome research.

[104]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[105]  R. Tibshirani,et al.  Generalized Additive Models: Some Applications , 1987 .

[106]  J. Copas Regression, Prediction and Shrinkage , 1983 .

[107]  J. Gilbert,et al.  Complement Factor H Variant Increases the Risk of Age-Related Macular Degeneration , 2005, Science.

[108]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[109]  D. Zilberman,et al.  Genome-Wide Evolutionary Analysis of Eukaryotic DNA Methylation , 2010, Science.

[110]  Ian J. Brown,et al.  Human metabolic phenotype diversity and its association with diet and blood pressure , 2008, Nature.

[111]  F. Dudbridge,et al.  Estimation of significance thresholds for genomewide association scans , 2008, Genetic epidemiology.

[112]  Meïli C. Baragatti,et al.  A study of variable selection using g-prior distribution with ridge parameter , 2011, Comput. Stat. Data Anal..

[113]  C. Wild Complementing the Genome with an “Exposome”: The Outstanding Challenge of Environmental Exposure Measurement in Molecular Epidemiology , 2005, Cancer Epidemiology Biomarkers & Prevention.

[114]  Matthias Heinig,et al.  New Insights into the Genetic Control of Gene Expression using a Bayesian Multi-tissue Approach , 2010, PLoS Comput. Biol..

[115]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[116]  Marc Chadeau-Hyam,et al.  ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration , 2011, Bioinform..

[117]  P. Bühlmann,et al.  Estimation for High‐Dimensional Linear Mixed‐Effects Models Using ℓ1‐Penalization , 2010, 1002.3784.

[118]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.