Subspace discriminant index to expedite exploration of multi-class omics data

Abstract Omics datasets, comprehensively characterizing biological samples at a molecular level, are continuously increasing in both complexity and dimensionality. In this scenario, there is a need for tools to improve data interpretability, expediting the process of extracting relevant biochemical information. Here we introduce the subspace discriminant index (SDI) for multi-component models, which points to the most promising components to explore pre-defined groups of observations, and can also be used to compare several modeling variants in terms of discriminative power. The SDI is especially useful during the initial exploration of a data set, in order to make informed decisions on, e.g., pre-processing or modeling variants for further analysis. The versatility and the efficiency of the proposed index is demonstrated in two real world omics case studies, including a highly complex multi-class problem. The code for the computation of the SDI is freely available in the Matlab MEDA toolbox and linked in the present manuscript. By boosting the interpretation capabilities, the SDI represents a significant addition to the chemometric toolbox.

[1]  Emmanuel Hatzakis,et al.  Quality assessment and authentication of virgin olive oil by NMR spectroscopy: a critical review. , 2013, Analytica chimica acta.

[2]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[3]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[4]  Dimitrios Boskou,et al.  Olive Oil Composition , 2006 .

[5]  William J. Griffiths,et al.  Mass spectrometry: from proteomics to metabolomics and lipidomics. , 2009, Chemical Society reviews.

[6]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[7]  M. Schatz Biological data sciences in genome research , 2015, Genome research.

[8]  J. Jaumot,et al.  Lipidomic data analysis: tutorial, practical guidelines and applications. , 2015, Analytica chimica acta.

[9]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[10]  H. Cheung,et al.  Lipidomic study of olive fruit and oil using TiO2 nanoparticle based matrix solid-phase dispersion and MALDI-TOF/MS , 2013 .

[11]  Perttu S. Niemelä,et al.  Bioinformatics and computational methods for lipidomics. , 2009, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[12]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[13]  Edoardo Saccenti,et al.  Group-Wise Principal Component Analysis for Exploratory Data Analysis , 2017 .

[14]  James C. Pino,et al.  Integrated, High-Throughput, Multiomics Platform Enables Data-Driven Construction of Cellular Responses and Reveals Global Drug Mechanisms of Action. , 2017, Journal of proteome research.

[15]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[16]  Fiona Crawford,et al.  Chronic elevation of phosphocholine containing lipids in mice exposed to Gulf War agents pyridostigmine bromide and permethrin. , 2013, Neurotoxicology and teratology.

[17]  José Camacho,et al.  Multivariate Exploratory Data Analysis (MEDA) Toolbox for Matlab , 2015 .

[18]  N. M. Faber,et al.  How to avoid over-fitting in multivariate calibration--the conventional validation approach and an alternative. , 2007, Analytica chimica acta.

[19]  Peng Gao,et al.  Application of fuzzy c-means clustering in data analysis of metabolomics. , 2009, Analytical chemistry.

[20]  Johan Trygg,et al.  Chemometrics in metabonomics. , 2007, Journal of proteome research.

[21]  Jianren Gu,et al.  Plasma phospholipid metabolic profiling and biomarkers of type 2 diabetes mellitus based on high-performance liquid chromatography/electrospray mass spectrometry and multivariate statistical analysis. , 2005, Analytical chemistry.

[22]  José Camacho,et al.  On the use of the observation‐wise k‐fold operation in PCA cross‐validation , 2015 .

[23]  Xianlin Han,et al.  Lipidomics: Comprehensive Mass Spectrometry of Lipids , 2016 .

[24]  Edward A Dennis,et al.  Applications of mass spectrometry to lipids and membranes. , 2011, Annual review of biochemistry.

[25]  J. Camacho,et al.  All Sparse PCA Models Are Wrong, But Some Are Useful. Part I: Computation of Scores, Residuals and Explained Variance , 2019, Chemometrics and Intelligent Laboratory Systems.

[26]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[27]  Ji Yeon Hong,et al.  Novel approach for analysis of bronchoalveolar lavage fluid (BALF) using HPLC-QTOF-MS-based lipidomics: lipid levels in asthmatics and corticosteroid-treated asthmatic patients. , 2014, Journal of proteome research.

[28]  Salah D. Qanadli,et al.  Topographical Body Fat Distribution Links to Amino Acid and Lipid Metabolism in Healthy Non-Obese Women , 2013, PloS one.

[29]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[30]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[31]  D. Dunger,et al.  The development and validation of a fast and robust dried blood spot based lipid profiling method to study infant metabolism , 2014, Metabolomics.