Simplivariate Models: Uncovering the Underlying Biology in Functional Genomics Data

One of the first steps in analyzing high-dimensional functional genomics data is an exploratory analysis of such data. Cluster Analysis and Principal Component Analysis are then usually the method of choice. Despite their versatility they also have a severe drawback: they do not always generate simple and interpretable solutions. On the basis of the observation that functional genomics data often contain both informative and non-informative variation, we propose a method that finds sets of variables containing informative variation. This informative variation is subsequently expressed in easily interpretable simplivariate components. We present a new implementation of the recently introduced simplivariate models. In this implementation, the informative variation is described by multiplicative models that can adequately represent the relations between functional genomics data. Both a simulated and two real-life metabolomics data sets show good performance of the method.

[1]  Elaine Holmes,et al.  Statistical total correlation spectroscopy editing of 1H NMR spectra of biofluids: application to drug metabolite profile identification and enhanced information recovery. , 2009, Analytical chemistry.

[2]  C. B. Lucasius,et al.  Understanding and using genetic algorithms Part 2. Representation, configuration and hybridization , 1994 .

[3]  Wolfgang Gaul,et al.  From Data to Knowledge: Theoretical and Practical Aspects of Classification, Data Analysis, and Knowledge Organization , 1996 .

[4]  A. D. Gordon Null Models in Cluster Validation , 1996 .

[5]  Andreas Karlsson,et al.  Matrix Analysis for Statistics , 2007, Technometrics.

[6]  Kenneth A. De Jong,et al.  Using Genetic Algorithms to Solve NP-Complete Problems , 1989, ICGA.

[7]  J. David Schaffer,et al.  Proceedings of the third international conference on Genetic algorithms , 1989 .

[8]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[9]  Age K. Smilde,et al.  Simplivariate Models: Ideas and First Examples , 2008, PloS one.

[10]  Wojtek J. Krzanowski,et al.  Biclustering models for structured microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[12]  Elaine Holmes,et al.  Metabonomic investigations of aging and caloric restriction in a life-long dog study. , 2007, Journal of proteome research.

[13]  Ralf Steuer,et al.  Review: On the analysis and interpretation of correlations in metabolomic data , 2006, Briefings Bioinform..

[14]  W. F. Zheng,et al.  NMR‐Based Metabonomics for Detection of Helicobacter pylori Infection in Gerbils: Which Is More Descriptive , 2008, Helicobacter.

[15]  Peter de B. Harrington,et al.  Analysis of variance–principal component analysis: A soft tool for proteomic discovery , 2005 .

[16]  Edoardo Saccenti,et al.  Individual human phenotypes in metabolic space and time. , 2009, Journal of proteome research.

[17]  J. van Heijenoort Recent advances in the formation of the bacterial peptidoglycan monomer unit. , 2001, Natural product reports.

[18]  Bart Selman,et al.  Computational science: A hard statistical view , 2008, Nature.

[19]  David W Salt,et al.  Judging the significance of multiple linear regression models. , 2005, Journal of medicinal chemistry.

[20]  A. Petros,et al.  Characterization of a posttranslational fucosylation in the growth factor domain of urinary plasminogen activator. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[21]  A. Raftery Choosing Models for Cross-Classifications , 1986 .

[22]  Ivano Bertini,et al.  Evidence of different metabolic phenotypes in humans , 2008, Proceedings of the National Academy of Sciences.

[23]  E. Snell,et al.  Reversibility of the tryptophanase reaction: synthesis of tryptophan from indole, pyruvate, and ammonia. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[24]  C. B. Lucasius,et al.  Understanding and using genetic algorithms Part 1. Concepts, properties and context , 1993 .

[25]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[26]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[27]  A. Smilde,et al.  Fusion of mass spectrometry-based metabolomics data. , 2005, Analytical chemistry.

[28]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[29]  A. Lundblad,et al.  A new type of carbohydrate-protein linkage in a glycopeptide from normal human urine. , 1975, The Journal of biological chemistry.

[30]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[31]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[32]  H. Chipman,et al.  Interpretable dimension reduction , 2005 .

[33]  Peter D. Karp,et al.  EcoCyc: a comprehensive database resource for Escherichia coli , 2004, Nucleic Acids Res..

[34]  Iven Van Mechelen,et al.  UvA-DARE ( Digital Academic Repository ) A structured overview of simultaneous component based data integration , 2009 .

[35]  J. Topliss,et al.  Chance correlations in structure-activity studies using multiple regression analysis , 1972 .

[36]  M. Stiles,et al.  Escherichia coli variants for gas and indole production at elevated incubation temperatures , 1984, Applied and environmental microbiology.

[37]  R. Wildman,et al.  Advanced Human Nutrition , 1999 .

[38]  I. Wilson,et al.  Gut microorganisms, mammalian metabolism and personalized health care , 2005, Nature Reviews Microbiology.

[39]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[40]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[41]  Marcel J. T. Reinders,et al.  Fewer permutations, more accurate P-values , 2009, Bioinform..

[42]  Merlin C. Thomas,et al.  Increased tubular organic ion clearance following chronic ACE inhibition in patients with type 1 diabetes. , 2005, Kidney international.

[43]  R. Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[44]  P. Mendes,et al.  The origin of correlations in metabolomics data , 2005, Metabolomics.

[45]  A. K. Smilde,et al.  Genetic algorithm based two-mode clustering of metabolomics data , 2008, Metabolomics.

[46]  L. Tenori,et al.  The metabonomic signature of celiac disease. , 2009, Journal of proteome research.

[47]  L. Lazzeroni Plaid models for gene expression data , 2000 .