Critical comparison of methods for fault diagnosis in metabolomics data

Platforms like metabolomics provide an unprecedented view on the chemical versatility in biomedical samples. Many diseases reflect themselves as perturbations in specific metabolite combinations. Multivariate analyses are essential to detect such combinations and associate them to specific diseases. For this, usually targeted discriminations of samples associated to a specific disease from non-diseased control samples are used. Such targeted data interpretation may not respect the heterogeneity of metabolic responses, both between diseases and within diseases. Here we show that multivariate methods that find any set of perturbed metabolites in a single patient, may be employed in combination with data collected with a single metabolomics technology to simultaneously investigate a large array of diseases. Several such untargeted data analysis approaches have been already proposed in other fields to find both expected and unexpected perturbations, e.g. in Statistical Process Control. We have critically compared several of these approaches for their sensitivity and their correct identification of the specifically perturbed metabolites. Also a new approach is introduced for this purpose. The newly introduced Sparse Mean approach, which we find here as most sensitive and best able to identify the specifically perturbed metabolites, turns metabolomics into an untargeted diagnostic platform. Aside from metabolomics, the proposed approach may greatly benefit fault diagnosis with untargeted analyses in many other fields, such as Industrial Process Control, food Adulteration Detection, and Intrusion Detection.

[1]  Gabriel Maciá-Fernández,et al.  Hierarchical PCA-based multivariate statistical network monitoring for anomaly detection , 2016, 2016 IEEE International Workshop on Information Forensics and Security (WIFS).

[2]  Lutgarde M. C. Buydens,et al.  Sparse statistical health monitoring: A novel variable selection approach to diagnosis and follow-up of individual patients , 2017 .

[3]  Alessandro Beghi,et al.  Data-driven Fault Detection and Diagnosis for HVAC water chillers , 2016 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  D. Curran‐Everett,et al.  Identification of asthma phenotypes using cluster analysis in the Severe Asthma Research Program. , 2010, American journal of respiratory and critical care medicine.

[6]  Age K. Smilde,et al.  Generalized contribution plots in multivariate statistical process monitoring , 2000 .

[7]  Wei Jiang,et al.  High-Dimensional Process Monitoring and Fault Isolation via Variable Selection , 2009 .

[8]  Theodora Kourti,et al.  Statistical Process Control of Multivariate Processes , 1994 .

[9]  Correction: Towards the Disease Biomarker in an Individual Patient Using Statistical Health Monitoring , 2014, PLoS ONE.

[10]  Giovanna Capizzi,et al.  A Least Angle Regression Control Chart for Multidimensional Data , 2011, Technometrics.

[11]  K. Strimmer,et al.  Optimal Whitening and Decorrelation , 2015, 1512.00809.

[12]  Furong Gao,et al.  Review of Recent Research on Data-Based Process Monitoring , 2013 .

[13]  Mudita Singhal,et al.  COPASI - a COmplex PAthway SImulator , 2006, Bioinform..

[14]  Michael Wolf,et al.  Spectrum Estimation: A Unified Framework for Covariance Matrix Estimation and PCA in Large Dimensions , 2013, J. Multivar. Anal..

[15]  Si-Zhao Joe Qin,et al.  Survey on data-driven industrial process monitoring and diagnosis , 2012, Annu. Rev. Control..

[16]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[17]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[18]  Emma Saavedra,et al.  Metabolic Control Analysis: A Tool for Designing Strategies to Manipulate Metabolic Pathways , 2008, Journal of biomedicine & biotechnology.

[19]  D. Kell,et al.  Schemes of flux control in a model of Saccharomyces cerevisiae glycolysis. , 2002, European journal of biochemistry.

[20]  Samuel Verdú,et al.  Detection of adulterations with different grains in wheat products based on the hyperspectral image technique: The specific cases of flour and bread , 2016 .

[21]  S. Joe Qin,et al.  Statistical process monitoring: basics and beyond , 2003 .

[22]  T. Cai,et al.  A Direct Estimation Approach to Sparse Linear Discriminant Analysis , 2011, 1107.3442.

[23]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[24]  Zhiqiang Ge,et al.  Data Mining and Analytics in the Process Industry: The Role of Machine Learning , 2017, IEEE Access.

[25]  Eun Sug Park,et al.  Comparing a new algorithm with the classic methods for estimating the number of factors , 1999 .

[26]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[27]  Zhiqiang Ge,et al.  Review on data-driven modeling and monitoring for plant-wide industrial processes , 2017 .

[28]  Lutgarde M. C. Buydens,et al.  An overview of large‐dimensional covariance and precision matrix estimators with applications in chemometrics , 2017 .

[29]  P. Miller,et al.  Contribution plots: a missing link in multivariate quality control , 1998 .

[30]  Luigi Atzori,et al.  Statistical Health Monitoring Applied to a Metabolomic Study of Experimental Hepatocarcinogenesis: An Alternative Approach to Supervised Methods for the Identification of False Positives. , 2016, Analytical chemistry.

[31]  Nola D. Tracy,et al.  Multivariate Control Charts for Individual Observations , 1992 .