Self Organising Maps for variable selection: Application to human saliva analysed by nuclear magnetic resonance spectroscopy to investigate the effect of an oral healthcare product

SOMs (Self Organising Maps) are derived from the machine learning literature and serve as a valuable method for representing data. In this paper, the use of SOMs as a technique for determining the most significant variables (or markers) in a dataset is described. The method is applied to the NMR spectra of 96 human saliva samples, half of which have been treated with an oral rinse formulation and half of which are controls, and 49 variables consisting of bucketed intensities. In addition, three simulations, two of which consist of the same number of samples and variables as the experimental dataset and a third that contains a much larger number of variables, are described. Two of the simulations contain known discriminatory variables, and the remaining is treated as a null dataset without any specific discriminatory variables added. The described SOM method is contrasted to Partial Least Squares Discriminant Analysis, and a list of the markers determined to be most significant using both approaches was obtained and the differences arising are discussed. A SOM Discrimination Index (SOMDI) is defined, whose magnitude relates to how strongly a variable is considered to be a discriminator. In order to ensure that the model is stable and not dependent on the random starting point of the SOM, one hundred iterations were performed and variables that were consistently of high rank were selected. A variety of approaches for data representation are illustrated, and the main theoretical principles of employing SOMs for determining which variables are most significant are outlined. Software used in this paper was written in-house, allowing greater flexibility over existing packages, and tailored for the specific application in hand.

[1]  Jorma Laaksonen,et al.  SOM_PAK: The Self-Organizing Map Program Package , 1996 .

[2]  Richard G. Brereton,et al.  Introduction to multivariate calibration in analytical chemistry , 2000 .

[3]  J. Miller,et al.  Statistics and chemometrics for analytical chemistry , 2005 .

[4]  D. Penn,et al.  Individual and gender fingerprints in human body odour , 2007, Journal of The Royal Society Interface.

[5]  S. L. Hazen,et al.  Human neutrophils employ the myeloperoxidase-hydrogen peroxide-chloride system to oxidize alpha-amino acids to a family of reactive aldehydes. Mechanistic studies identifying labile intermediates along the reaction pathway. , 1998, The Journal of biological chemistry.

[6]  A. Smilde,et al.  How to distinguish healthy from diseased? Classification strategy for mass spectrometry‐based clinical proteomics , 2007, Proteomics.

[7]  Desire L. Massart,et al.  Using contrasts as data pretreatment method in pattern recognition of multivariate data , 1999 .

[8]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[9]  Royston Goodacre,et al.  Metabolomics: Current technologies and future trends , 2006, Proteomics.

[10]  Juha Vesanto,et al.  SOM-based data visualization methods , 1999, Intell. Data Anal..

[11]  D. Penn,et al.  Comparison of human axillary odour profiles obtained by gas chromatography/mass spectrometry and skin microbial profiles obtained by denaturing gradient gel electrophoresis using multivariate pattern recognition , 2007, Metabolomics.

[12]  Ilpo Vattulainen,et al.  Conformational analysis of lipid molecules by self-organizing maps. , 2007, The Journal of chemical physics.

[13]  Ersin Bayram,et al.  Supervised Self-Organizing Maps in Drug Discovery. 1. Robust Behavior with Overdetermined Data Sets , 2005, J. Chem. Inf. Model..

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[16]  Richard G. Brereton,et al.  Chemometrics: Data Analysis for the Laboratory and Chemical Plant , 2003 .

[17]  Hyung-Kyoon Choi,et al.  Metabolic fingerprinting of wild type and transgenic tobacco plants by 1H NMR and multivariate analysis technique. , 2004, Phytochemistry.

[18]  Richard G. Brereton,et al.  Pattern Recognition of Gas Chromatography Mass Spectrometry of Human Volatiles in Sweat to distinguish the sex of subjects and determine potential Discriminatory Marker Peaks , 2007 .

[19]  Richard G. Brereton,et al.  Learning Vector Quantization for Multiclass Classification: Application to Characterization of Plastics , 2007, J. Chem. Inf. Model..

[20]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[21]  Richard G Brereton,et al.  Self Organising Maps for distinguishing polymer groups using thermal response curves obtained by dynamic mechanical analysis. , 2008, The Analyst.

[22]  Yun Xu,et al.  Support Vector Machines: A Recent Method for Classification in Chemometrics , 2006 .

[23]  Federico Marini,et al.  Use of different artificial neural networks to resolve binary blends of monocultivar Italian olive oils. , 2007, Analytica chimica acta.

[24]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[25]  R. H. Simoyi,et al.  Oxyhalogen−Sulfur Chemistry: Oxidation of Taurine by Chlorite in Acidic Medium1 , 1997 .

[26]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[27]  D. Naughton,et al.  Multicomponent spectroscopic investigations of salivary antioxidant consumption by an oral rinse preparation containing the stable free radical species chlorine dioxide (ClO2.). , 1997, Free radical research.

[28]  D. Massart,et al.  Feature selection for the discrimination between pollution types with partial least squares modelling , 1996 .

[29]  Ubonrat Siripatrawan,et al.  Self-organizing algorithm for classification of packaged fresh vegetable potentially contaminated with foodborne pathogens , 2008 .

[30]  Hein Putter,et al.  The bootstrap: a tutorial , 2000 .

[31]  Richard G. Brereton,et al.  Chemometrics for Pattern Recognition , 2009 .

[32]  Dustin J Penn,et al.  Consensus multivariate methods in gas chromatography mass spectrometry and denaturing gradient gel electrophoresis: MHC-congenic and other strains of mice can be classified according to the profiles of volatiles and microflora in their scent-marks. , 2009, The Analyst.

[33]  R. Wehrens,et al.  Bootstrapping principal component regression models , 1997 .