Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm)

This article describes the applicability of multivariate projection techniques, such as principal-component analysis (PCA) and partial least-squares (PLS) projections to latent structures, to the large-volume high-density data structures obtained within genomics, proteomics, and metabonomics. PCA and PLS, and their extensions, derive their usefulness from their ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. Three examples are used as illustrations: the first example is a genomics data set and involves modeling of microarray data of cell cycle-regulated genes in the microorganism Saccharomyces cerevisiae. The second example contains NMR-metabonomics data, measured on urine samples of male rats treated with either of the drugs chloroquine or amiodarone. The third and last data set describes sequence-function classification studies in a set of G-protein-coupled receptors using hierarchical PCA.

[1]  Bernd Beck,et al.  Onion design and its application to a pharmaceutical QSAR problem , 2004 .

[2]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[3]  S Wold,et al.  Three-block bi-focal PLS (3BIF-PLS) and its application in QSAR , 2004, SAR and QSAR in environmental research.

[4]  Erik Johansson,et al.  Multi- and Megavariate Data Analysis: Finding and Using Regularities in Metabonomics Data , 2005 .

[5]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[6]  E. Winzeler,et al.  Genomics, gene expression and DNA arrays , 2000, Nature.

[7]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[8]  Sidney Addelman,et al.  trans-Dimethanolbis(1,1,1-trifluoro-5,5-dimethylhexane-2,4-dionato)zinc(II) , 2008, Acta crystallographica. Section E, Structure reports online.

[9]  John C. Lindon,et al.  Detection of in vivo biomarkers of phospholipidosis using NMR‐based metabonomic approaches , 2001 .

[10]  Alison J. Burnham,et al.  Frameworks for latent variable multivariate regression , 1996 .

[11]  S. Wold,et al.  New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. , 1998, Journal of medicinal chemistry.

[12]  J. Lindon,et al.  Metabonomics: a platform for studying drug toxicity and gene function , 2002, Nature Reviews Drug Discovery.

[13]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[14]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[15]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[16]  Svante Wold,et al.  Modelling and diagnostics of batch processes and analogous kinetic experiments , 1998 .

[17]  Torbjörn Lundstedt,et al.  Multivariate analysis of G protein‐coupled receptors , 2003 .

[18]  John F. MacGregor,et al.  Multi-way partial least squares in monitoring batch processes , 1995 .

[19]  Anders Berglund,et al.  Alignment of flexible molecules at their receptor site using 3D descriptors and Hi-PCA , 1997, J. Comput. Aided Mol. Des..

[20]  J. Kalivas,et al.  Interrelationships of multivariate regression methods using eigenvector basis sets , 1999 .

[21]  Alison J. Burnham,et al.  LATENT VARIABLE MULTIVARIATE REGRESSION MODELING , 1999 .

[22]  S. Wold,et al.  INLR, implicit non‐linear latent variable regression , 1997 .

[23]  S. Wold,et al.  PLS regression on wavelet compressed NIR spectra , 1998 .

[24]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[25]  Ing-Marie Olsson,et al.  D-optimal onion designs in statistical molecular design , 2004 .

[26]  S. Wold,et al.  Partial least squares analysis with cross‐validation for the two‐class problem: A Monte Carlo study , 1987 .

[27]  Erik Johansson,et al.  GIFI‐PLS: Modeling of Non‐Linearities and Discontinuities in QSAR , 2000 .

[28]  S. Wold Nonlinear partial least squares modelling II. Spline inner relation , 1992 .

[29]  S. Boyd,et al.  Acute neurology and neurophysiology of haemolytic–uraemic syndrome , 2001, Archives of disease in childhood.

[30]  Timothy M. D. Ebbels,et al.  Batch statistical processing of 1H NMR‐derived urinary spectral data , 2002 .

[31]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[32]  Svante Wold,et al.  PLS DISCRIMINANT PLOTS , 1986 .

[33]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[34]  S. Wold,et al.  Orthogonal signal correction of near-infrared spectra , 1998 .

[35]  S. Wold,et al.  Nonlinear PLS modeling , 1989 .

[36]  S. Wold,et al.  Multi‐way principal components‐and PLS‐analysis , 1987 .

[37]  Bruce R. Kowalski,et al.  Chemometrics, mathematics and statistics in chemistry , 1984 .

[38]  Erik Johansson,et al.  Megavariate analysis of hierarchical QSAR data , 2002, J. Comput. Aided Mol. Des..

[39]  S. Wold,et al.  Some recent developments in PLS modeling , 2001 .

[40]  Hugo Kubinyi,et al.  3D QSAR in drug design : theory, methods and applications , 2000 .

[41]  Torbjörn Lundstedt,et al.  Hierarchical principal component analysis (PCA) and projection to latent structure (PLS) technique on spectroscopic data as a data pretreatment for calibration , 2001 .