Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline

Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i.e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.

[1]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[2]  Douglas B. Kell,et al.  Statistical strategies for avoiding false discoveries in metabolomics and related experiments , 2007, Metabolomics.

[3]  Ming Ouyang,et al.  DNA microarray data imputation and significance analysis of differential expression , 2005, Bioinform..

[4]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[5]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[6]  H. Senn,et al.  Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. , 2006, Analytical chemistry.

[7]  Mamdouh Refaat Treatment of Missing Values , 2007 .

[8]  D. Massart,et al.  Dealing with missing data , 2001 .

[9]  Nigel W. Hardy,et al.  Proposed minimum reporting standards for chemical analysis , 2007, Metabolomics.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  J. Robben,et al.  Treatment of missing values for multivariate statistical analysis of gel‐based proteomics data , 2008, Proteomics.

[12]  Joachim Selbig,et al.  A gentle guide to the analysis of metabolomic data. , 2007, Methods in molecular biology.

[13]  M. Viant,et al.  Application of metabolomics to investigate the process of human orthotopic liver transplantation: a proof-of-principle study. , 2010, Omics : a journal of integrative biology.

[14]  P Barbry,et al.  Gene Expression Profiling of Human Liver Transplants Identifies an Early Transcriptional Signature Associated with Initial Poor Graft Function , 2008, American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons.

[15]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[16]  Theodoros N. Arvanitis,et al.  A new approach to toxicity testing in Daphnia magna: application of high throughput FT-ICR mass spectrometry metabolomics , 2009, Metabolomics.

[17]  Theodoros N. Arvanitis,et al.  A signal filtering method for improved quantification and noise discrimination in fourier transform ion cyclotron resonance mass spectrometry-based metabolomics data , 2009, Journal of the American Society for Mass Spectrometry.

[18]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[19]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[20]  Mark R. Viant,et al.  Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation , 2007, BMC Bioinformatics.

[21]  H. Redkey,et al.  A new approach. , 1967, Rehabilitation record.

[22]  Søren Feodor Nielsen,et al.  Inference and Missing Data: Asymptotic Results , 1997 .

[23]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[24]  Tero Aittokallio,et al.  Missing value imputation improves clustering and interpretation of gene expression microarray data , 2008, BMC Bioinformatics.

[25]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[26]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[27]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[28]  Marie Brown,et al.  Robust Early Pregnancy Prediction of Later Preeclampsia Using Metabolomic Biomarkers , 2010, Hypertension.

[29]  Theodoros N. Arvanitis,et al.  Dynamic range and mass accuracy of wide-scan direct infusion nanoelectrospray fourier transform ion cyclotron resonance mass spectrometry-based metabolomics increased by the spectral stitching method. , 2007, Analytical chemistry.

[30]  Rasmus Bro,et al.  Improving the speed of multi-way algorithms:: Part I. Tucker3 , 1998 .

[31]  Mark R Viant,et al.  Spectral relative standard deviation: a practical benchmark in metabolomics. , 2009, The Analyst.

[32]  Ralf J M Weber,et al.  Discriminating between different acute chemical toxicities via changes in the daphnid metabolome. , 2010, Toxicological sciences : an official journal of the Society of Toxicology.

[33]  Lutgarde M. C. Buydens,et al.  Fusion of metabolomics and proteomics data for biomarkers discovery: case study on the experimental autoimmune encephalomyelitis , 2011, BMC Bioinformatics.

[34]  Lyle Burton,et al.  Investigation of analytical variation in metabonomic analysis using liquid chromatography/mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[35]  Ida Scheel,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[36]  S Y Lee,et al.  A test of missing completely at random for longitudinal data with missing observations. , 1997, Statistics in medicine.

[37]  David S. Wishart,et al.  MetaboAnalyst: a web server for metabolomic data analysis and interpretation , 2009, Nucleic Acids Res..

[38]  C. Flechtenmacher,et al.  Taurine Protects from Liver Injury after Warm Ischemia in Rats: The Role of Kupffer Cells , 2007, European Surgical Research.

[39]  Reinhard Guthke,et al.  Missing values in gel‐based proteomics , 2010, Proteomics.

[40]  Doheon Lee,et al.  Data and text mining Towards clustering of incomplete microarray data without the use of imputation , 2006 .

[41]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[42]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[43]  David Mayer,et al.  Arginine and Urea Metabolism in the Liver Graft: A Study Using Microdialysis in Human Orthotopic Liver Transplantation , 2006, Transplantation.

[44]  M. Viant,et al.  High-throughput tissue extraction protocol for NMR- and MS-based metabolomics. , 2008, Analytical biochemistry.

[45]  Rasmus Bro,et al.  Improving the speed of multiway algorithms: Part II: Compression , 1998 .

[46]  D. Kell,et al.  Metabolomics by numbers: acquiring and understanding global metabolite data. , 2004, Trends in biotechnology.

[47]  Ki-Yeol Kim,et al.  Reuse of imputed data in microarray analysis increases imputation efficiency , 2004, BMC Bioinformatics.