Megavariate analysis of hierarchical QSAR data

Multivariate PCA- and PLS-models involving many variables are often difficult to interpret, because plots and lists of loadings, coefficients, VIPs, etc, rapidly become messy and hard to overview. There may then be a strong temptation to eliminate variables to obtain a smaller data set. Such a reduction of variables, however, often removes information and makes the modelling efforts less reliable. Model interpretation may be misleading and predictive power may deteriorate.A better alternative is usually to partition the variables into blocks of logically related variables and apply hierarchical data analysis. Such blocked data may be analyzed by PCA and PLS. This modelling forms the base-level of the hierarchical modelling set-up. On the base-level in-depth information is extracted for the different blocks. The score vectors formed on the base-level, here called `super variables', may be linked together in new matrices on the top-level. On the top-level superficial relationships between the X- and the Y-data are investigated.In this paper the basic principles of hierarchical modelling by means of PCA and PLS are reviewed. One objective of the paper is to disseminate this concept to a broader QSAR audience. The hierarchical methods are used to analyze a set of 10 haloalkanes for which K = 30 chemical descriptors and M = 255 biological responses have been gathered. Due to the complexity of the biological data, they are sub-divided in four blocks. All the modelling steps on the base-level and the top-level are reported and the final QSAR model is interpreted thoroughly.

[1]  Torbjörn Lundstedt,et al.  Hierarchical principal component analysis (PCA) and projection to latent structure (PLS) technique on spectroscopic data as a data pretreatment for calibration , 2001 .

[2]  Nils-Olof Lindberg,et al.  Multivariate methods in pharmaceutical applications , 2002 .

[3]  C. Luttmann,et al.  Multivariate data analysis using D-optimal designs, partial least squares, and response surface modeling: A directional approach for the analysis of farnesyltransferase inhibitors. , 2000, Journal of medicinal chemistry.

[4]  Lennart Eriksson,et al.  External validation of a QSAR for the acute toxicity of halogenated aliphatic hydrocarbons , 1993 .

[5]  Svante Wold,et al.  A strategy for ranking environmentally occurring chemicals. Part V: The development of two genotoxicity QSARs for halogenated aliphatics , 1991 .

[6]  John F. MacGregor,et al.  Adaptive batch monitoring using hierarchical PCA , 1998 .

[7]  Svante Wold,et al.  A Strategy for Ranking Environmentally Occurring Chemicals. Part IV: Development of Chemical Model Systems for Characterization of Halogenated Aliphatic Hydrocarbons , 1991 .

[8]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[9]  Svante Wold,et al.  Modelling the Cytotoxicity of Halogenated Aliphatic Hydrocarbons. Quantitative Structure-Activity Relationships for the IC50 to Human HeLa Cells , 1993 .

[10]  Anders Berglund,et al.  Alignment of flexible molecules at their receptor site using 3D descriptors and Hi-PCA , 1997, J. Comput. Aided Mol. Des..

[11]  Matthew Clark,et al.  The Probability of Chance Correlation Using Partial Least Squares (PLS) , 1993 .

[12]  Willie J.G.M. Peijnenburg,et al.  Multivariate QSAR modelling of the rate of reductive dehalogenation of haloalkanes , 1996 .

[13]  Svante Wold,et al.  Multivariate Parametrization of 55 Coded and Non‐Coded Amino Acids , 1989 .

[14]  Svante Wold,et al.  A strategy for ranking environmentally occurring chemicals , 1989 .

[15]  Svante Wold,et al.  A strategy for ranking environmentally occurring chemicals. Part III: Multivariate quantitative structure‐activity relationships for halogenated aliphatics , 1990 .

[16]  J. Kalivas,et al.  Interrelationships of multivariate regression methods using eigenvector basis sets , 1999 .

[17]  Alison J. Burnham,et al.  LATENT VARIABLE MULTIVARIATE REGRESSION MODELING , 1999 .

[18]  Svante Wold,et al.  Rational ranking of chemicals according to environmental risk: An illustration using multivariate biological profiling of halogenated aliphatic hydrocarbons , 1992 .

[19]  Svante Wold,et al.  PLS DISCRIMINANT PLOTS , 1986 .

[20]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[21]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[22]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[23]  Alison J. Burnham,et al.  Frameworks for latent variable multivariate regression , 1996 .

[24]  S Wold,et al.  A strategy for ranking environmentally occurring chemicals. Part VI. QSARs for the mutagenic effects of halogenated aliphatics. , 1991, Acta chemica Scandinavica.

[25]  S Wold,et al.  Statistical molecular design of building blocks for combinatorial chemistry. , 2000, Journal of medicinal chemistry.

[26]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.