Multivariate Statistical Analysis using the R package chemometrics

In multivariate data analysis we observe not only a single variable or the relation between two variables but we consider several characteristics simultaneously. For a statistical analysis of chemical data (also called chemometrics) we have to take into account the special structure of this type of data. Classic model assumptions might not be fulfilled by chemical data, for instance there will be a large number of variables and only few observations, or correlations between the variables occur. To avoid problems arising from this fact, for chemometrics classical methods have to be adapted and new ones developed. The statistical environment R is a powerful tool for data analysis and graphical representation. It is an open source software with the possibility for many individuals to assist in improving the code and adding functions. One of those contributed function packages chemometrics implemented by Kurt Varmuza and Peter Filzmoser is designed especially for the multivariate analysis of chemical data and contains functions mostly for regression, classification and model evaluation. The work at hand is a vignette for this package and can be understood as a manual for its functionalities. The aim of this vignette is to explain the relevant methods and to demonstrate and compare them based on practical examples.

[1]  I. Jolliffe A Note on the Use of Principal Components in Regression , 1982 .

[2]  M. Hubert,et al.  Robust methods for partial least squares regression , 2003 .

[3]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[4]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[5]  Peter Filzmoser,et al.  Partial robust M-regression , 2005 .

[6]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[7]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[8]  D. Madigan,et al.  [Least Angle Regression]: Discussion , 2004 .

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  L. Firinguetti,et al.  Asymptotic confidence intervals in ridge regression based on the Edgeworth expansion , 2011 .

[13]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[14]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[15]  Ovidiu Ivanciuc,et al.  Applications of Support Vector Machines in Chemistry , 2007 .

[16]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[17]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[18]  Peter Filzmoser,et al.  Introduction to Multivariate Statistical Analysis in Chemometrics , 2009 .

[19]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[20]  B. Liebmann,et al.  Determination of glucose and ethanol in bioethanol production by near infrared spectroscopy and chemometrics. , 2009, Analytica chimica acta.

[21]  Ryan Womack,et al.  Introduction to R , 2010, IASSIST.

[22]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[23]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[24]  A. Höskuldsson PLS regression methods , 1988 .

[25]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[26]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[27]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[30]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[31]  H. J. H. Macfie,et al.  A robust PLS procedure , 1992 .

[32]  David J. Cummins,et al.  Iteratively reweighted partial least squares: A performance analysis by monte carlo simulation , 1995 .

[33]  Eric R. Ziegel,et al.  Chemometrics: Statistics and Computer Application in Analytical Chemistry , 2001, Technometrics.

[34]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[35]  Thomas Brady Neural Networks: An Overview , 1991 .