Chapter 15 PLS in Data Mining and Data Integration

Data mining by means of projection methods such as PLS (projection to latent structures), and their extensions is discussed. The most common data analytical questions in data mining are covered, and illustrated with examples. (a) Clustering, i.e., finding and interpreting "natural" groups in the data (b) Classification and identification, e.g., biologically active compounds vs inactive (c) Quantitative relationships between different sets of variables, e.g., finding vari- ables related to quality of a product, or related to time, seasonal or/and geo- graphical change Sub-problems occurring in both (a) to (c) are discussed. (1) Identification of outliers and their aberrant data profiles (2) Finding the dominating variables and their joint relationships (3) Making predictions for new samples The use of graphics for the contextual interpretation of results is emphasized. With many variables and few observations (samples) - a common situation in data mining - the risk to obtain spurious models is substantial. Spurious models look great for the training set data, but give miserable predictions for new samples. Hence, the validation of the data analytical results is essential, and approaches for that are discussed.

[1]  Svante Wold,et al.  The utility of multivariate design in PLS modeling , 2004 .

[2]  Anders Berglund,et al.  PCA and PLS with very large data sets , 2005, Comput. Stat. Data Anal..

[3]  L. Eriksson Multi- and megavariate data analysis , 2006 .

[4]  Ing-Marie Olsson,et al.  D-optimal onion designs in statistical molecular design , 2004 .

[5]  Erik Johansson,et al.  Megavariate analysis of hierarchical QSAR data , 2002, J. Comput. Aided Mol. Des..

[6]  Tudor I. Oprea,et al.  Chemography: the Art of Navigating in Chemical Space , 2000 .

[7]  Erik Johansson,et al.  Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm) , 2004, Analytical and bioanalytical chemistry.

[8]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[9]  Erik Johansson,et al.  On the selection of the training set in environmental QSAR analysis when compounds are clustered , 2000 .

[10]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[11]  Svante Wold,et al.  Modelling and diagnostics of batch processes and analogous kinetic experiments , 1998 .

[12]  Lutgarde M. C. Buydens,et al.  Molecular data-mining: a challenge for chemometrics , 1999 .

[13]  Ing-Marie Olsson,et al.  Controlling coverage of D‐optimal onion designs and selections , 2004 .

[14]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[15]  S Wold,et al.  Statistical molecular design of building blocks for combinatorial chemistry. , 2000, Journal of medicinal chemistry.

[16]  Svante Wold,et al.  PLS DISCRIMINANT PLOTS , 1986 .

[17]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[18]  Søren Balling Engelsen,et al.  Towards on-line monitoring of the composition of commercial carrageenan powders , 2004 .

[19]  Nouna Kettaneh-Wold,et al.  Analysis of mixture data with partial least squares , 1992 .

[20]  Tormod Næs,et al.  The flexibility of fuzzy clustering illustrated by examples , 1999 .

[21]  David J. Hand,et al.  Data Mining: Statistics and More? , 1998 .

[22]  Olof Svensson,et al.  Classification of Chemically Modified Celluloses Using a Near-Infrared Spectrometer and Soft Independent Modeling of Class Analogies , 1997 .

[23]  Ranjan Maitra,et al.  Clustering Massive Datasets With Application in Software Metrics and Tomography , 2001, Technometrics.

[24]  Erik Johansson,et al.  Four levels of pattern recognition , 1978 .