Improving the analysis of designed studies by combining statistical modelling with study design information

BackgroundIn the fields of life sciences, so-called designed studies are used for studying complex biological systems. The data derived from these studies comply with a study design aimed at generating relevant information while diminishing unwanted variation (noise). Knowledge about the study design can be used to decompose the total data into data blocks that are associated with specific effects. Subsequent statistical analysis can be improved by this decomposition if these are applied on selected combinations of effects.ResultsThe benefit of this approach was demonstrated with an analysis that combines multivariate PLS (Partial Least Squares) regression with data decomposition from ANOVA (Analysis of Variance): ANOVA-PLS. As a case, a nutritional intervention study is used on Apoliprotein E3-Leiden (APOE3Leiden) transgenic mice to study the relation between liver lipidomics and a plasma inflammation marker, Serum Amyloid A. The ANOVA-PLS performance was compared to PLS regression on the non-decomposed data with respect to the quality of the modelled relation, model reliability, and interpretability.ConclusionIt was shown that ANOVA-PLS leads to a better statistical model that is more reliable and better interpretable compared to standard PLS analysis. From a following biological interpretation, more relevant metabolites were derived from the model. The concept of combining data composition with a subsequent statistical analysis, as in ANOVA-PLS, is however not limited to PLS regression in metabolomics but can be applied for many statistical methods and many different types of data.

[1]  Ulrich Mansmann,et al.  GlobalANCOVA: exploration and assessment of gene group effects , 2008, Bioinform..

[2]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[3]  Marieke E. Timmerman,et al.  Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences , 2003 .

[4]  J. J. Jansen,et al.  ASCA: analysis of multivariate data obtained from an experimental design , 2005 .

[5]  Age K. Smilde,et al.  Statistical validation of megavariate effects in ASCA , 2007, BMC Bioinformatics.

[6]  Peter de B. Harrington,et al.  Analysis of variance–principal component analysis: A soft tool for proteomic discovery , 2005 .

[7]  Gordon K. Smyth,et al.  limmaGUI: A graphical user interface for linear modeling of microarray data , 2004, Bioinform..

[8]  S. F. Buck A Method of Estimation of Missing Values in Multivariate Data Suitable for Use with an Electronic Computer , 1960 .

[9]  Robert Kleemann,et al.  Rosuvastatin Reduces Atherosclerosis Development Beyond and Independent of Its Plasma Cholesterol–Lowering Effect in APOE*3-Leiden Transgenic Mice: Evidence for Antiinflammatory Effects of Rosuvastatin , 2003, Circulation.

[10]  N. M. Faber,et al.  Uncertainty estimation for multivariate regression coefficients , 2002 .

[11]  Devanand L. Luthria,et al.  UV spectral fingerprinting and analysis of variance-principal component analysis: a useful tool for characterizing sources of variance in plant materials. , 2008, Journal of agricultural and food chemistry.

[12]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[13]  B. Skagerberg,et al.  Predictive ability of regression models. Part I: Standard deviation of prediction errors (SDEP) , 1992 .

[14]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[15]  H. Ahrens Searle, S. R.: Linear Models. John Wiley & Sons, Inc., New York-London-Sydney-Toronto 1971. XXI, 532 S. $9.50 , 1974 .

[16]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[17]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[18]  S. Kersten,et al.  Nutrigenomics: goals and strategies , 2003, Nature Reviews Genetics.

[19]  Henk A. L. Kiers,et al.  Simultaneous Components Analysis , 1992 .

[20]  A. Smilde,et al.  Assessing the statistical validity of proteomics based biomarkers. , 2007, Analytica chimica acta.

[21]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[22]  G. Cruciani,et al.  Predictive ability of regression models. Part II: Selection of the best predictive PLS model , 1992 .

[23]  A. Scalbert,et al.  A liquid chromatography-quadrupole time-of-flight (LC-QTOF)-based metabolomic approach reveals new metabolic effects of catechin in rats fed high-fat diets. , 2008, Journal of proteome research.

[24]  H. Martens,et al.  Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR) , 2000 .

[25]  S. R. Searle Linear Models , 1971 .

[26]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[27]  Onno E. de Noord,et al.  Multilevel component analysis and multilevel PLS of chemical process data , 2005 .

[28]  R. Frants,et al.  Transgenic mice carrying the apolipoprotein E3-Leiden gene exhibit hyperlipoproteinemia. , 1993, The Journal of biological chemistry.

[29]  Yasunori Fujikoshi,et al.  Two-way ANOVA models with unbalanced data , 1993, Discret. Math..

[30]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[31]  D B Rubin,et al.  Multiple imputation in health-care databases: an overview and some applications. , 1991, Statistics in medicine.

[32]  A. Simopoulos Omega-3 Fatty Acids in Inflammation and Autoimmune Diseases , 2002, Journal of the American College of Nutrition.

[33]  A. Smilde,et al.  Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. , 2006, Analytical chemistry.