Computational Statistics Approaches to Study Metabolic Syndrome

In this chapter, we review a set of key research problems and methods in analysing ‘omics’ data, gene expression, proteomics, metabolomics, and lipidomics. We start with the common systems biology approach to study metabolic syndrome, as well as any other disease, namely comparative case-control setting. The setting is usually an over-simplification, since there are other covariates that affect the concentrations of molecules, for instance drug treatments, gender, body mass index (BMI), and time in time-series experiments. Given these covariates, the setting becomes a multi-way experimental design. When multiple data sources are available, such as several ‘omics’ types, multiple tissues or multiple species, each forms a different data space with different molecules or variables, bringing in the problem of data integration. We start by giving a brief tutorial on the commonly used basic univariate and multivariate statistical approaches applicable if the problem is simplified by stratifying to a case-control design. We then focus on the multi-way setups of the Analysis of Variance (ANOVA) type, and in particular their main difficulty for ‘omics’ data: the large number of variables compared to the small number of observations. We introduce a recent family of Bayesian methods that is able to deal with multi-way, multi-source data sets and to translate biomarkers between multiple species. The approach is able to handle small sample-size combined with high dimensionality, and it allows a rigorous estimation of uncertainty of the results.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Ø. Langsrud,et al.  50–50 multivariate analysis of variance for collinear responses , 2002 .

[3]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[4]  Olli Simell,et al.  Gender-dependent progression of systemic metabolic states in early childhood , 2008, Molecular systems biology.

[5]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[6]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[7]  Pascal J. Goldschmidt-Clermont,et al.  Of mice and men: Sparse statistical modeling in cardiovascular genomics , 2007, 0709.0165.

[8]  Mahlet G. Tadesse,et al.  A Stochastic Partitioning Method to Associate High-dimensional Responses and Covariates , 2009 .

[9]  Ziv Bar-Joseph,et al.  Cross species analysis of microarray expression data , 2009, Bioinform..

[10]  Pierre R. Bushel,et al.  Assessing Gene Significance from cDNA Microarray Expression Data via Mixed Models , 2001, J. Comput. Biol..

[11]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Matej Oresic,et al.  Cross-Species Translation of Multi-way Biomarkers , 2011, ICANN.

[13]  Samuel Kaski,et al.  High Density Lipoprotein Structural Changes and Drug Response in Lipidomic Profiles following the Long-Term Fenofibrate Therapy in the FIELD Substudy , 2011, PloS one.

[14]  Matej Oresic,et al.  Two-way analysis of high-dimensional collinear data , 2009, Data Mining and Knowledge Discovery.

[15]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[16]  Olli Simell,et al.  Dysregulation of lipid and amino acid metabolism precedes islet autoimmunity in children who later progress to type 1 diabetes , 2008, The Journal of experimental medicine.

[17]  Xihong Lin,et al.  Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection , 2009, Bioinform..

[18]  David Tritchler,et al.  Genome-wide sparse canonical correlation of gene expression with genotypes , 2007, BMC proceedings.

[19]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[20]  Bing Zhang,et al.  An Integrated Approach for the Analysis of Biological Pathways using Mixed Models , 2008, PLoS genetics.

[21]  N. Bratchell,et al.  Multivariate response surface modelling by principal components analysis , 1989 .

[22]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[23]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[24]  M. Rantalainen,et al.  Statistically integrated metabonomic-proteomic studies on a human prostate cancer xenograft model in mice. , 2006, Journal of proteome research.

[25]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[26]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[27]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[28]  Johan Trygg,et al.  O2‐PLS, a two‐block (X–Y) latent variable regression (LVR) method with an integral OSC filter , 2003 .

[29]  Geoffrey J. McLachlan,et al.  Integrative mixture of experts to combine clinical factors and gene markers , 2010, Bioinform..

[30]  Matej Oresic,et al.  Metabolic Regulation in Progression to Autoimmune Diabetes , 2011, PLoS Comput. Biol..

[31]  R. Cox,et al.  A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. , 2007, Physiological genomics.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[34]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[35]  G. Celeux,et al.  Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments , 2005 .

[36]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[37]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[38]  Ilkka Huopaniemi,et al.  Multivariate multi-way modelling of multiple high-dimensional data sources , 2012 .

[39]  Matej Oresic,et al.  Exploring the lipoprotein composition using Bayesian regression on serum lipidomic profiles , 2007, ISMB/ECCB.

[40]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[41]  Joel G. Pounds,et al.  Pacific Symposium on Biocomputing 14:451-463 (2009) A BAYESIAN INTEGRATION MODEL OF HIGH- THROUGHPUT PROTEOMICS AND METABOLOMICS DATA FOR IMPROVED EARLY DETECTION OF MICROBIAL INFECTIONS , 2022 .

[42]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[43]  Gordon K. Smyth,et al.  Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[44]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[45]  Matej Oresic,et al.  MPEA - metabolite pathway enrichment analysis , 2011, Bioinform..

[46]  Wei Pan,et al.  Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data , 2007, Bioinform..

[47]  Ziv Bar-Joseph,et al.  Analyzing time series gene expression data , 2004, Bioinform..

[48]  J. Nevins,et al.  Age- and sex-specific genomic profiles in non-small cell lung cancer. , 2010, JAMA.

[49]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[50]  Matej Oresic,et al.  Multivariate multi-way analysis of multi-source data , 2010, Bioinform..

[51]  Matej Oresic,et al.  Graphical Multi-way Models , 2010, ECML/PKDD.

[52]  Ziv Bar-Joseph,et al.  Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings , 2010, NIPS.

[53]  Doris Damian,et al.  Applications of a new subspace clustering algorithm (COSA) in medical systems biology , 2007, Metabolomics.

[54]  Matej Oresic,et al.  Matching samples of multiple views , 2011, Data Mining and Knowledge Discovery.