Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.

[1]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[2]  George C Runger,et al.  Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: missing value imputation using temporal datasets. , 2011, Molecular bioSystems.

[3]  Alun D. Preece,et al.  Information quality in proteomics , 2007, Briefings Bioinform..

[4]  Guy N. Brock,et al.  Biological impact of missing-value imputation on downstream analyses of gene expression profiles , 2011, Bioinform..

[5]  Kathryn S. Lilley,et al.  MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation , 2012, Bioinform..

[6]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[7]  M. Moran,et al.  Proteomic profiles of human lung adeno and squamous cell carcinoma using super‐SILAC and label‐free quantification approaches , 2014, Proteomics.

[8]  Richard D. Smith,et al.  Detecting differential protein expression in large-scale population proteomics , 2014, Bioinform..

[9]  Reinhard Guthke,et al.  Missing values in gel‐based proteomics , 2010, Proteomics.

[10]  Ito Wasito,et al.  Nearest neighbour approach in the least-squares data imputation algorithms , 2005, Inf. Sci..

[11]  Roger H. Johnson,et al.  Integrative analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: a non-linear model to predict abundance of undetected proteins , 2009, Bioinform..

[12]  Richard D Smith,et al.  Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. , 2015, Journal of proteome research.

[13]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[14]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[15]  Chong-Yu Xu,et al.  Comparison and evaluation of multiple GCMs, statistical downscaling and hydrological models in the study of climate change impacts on runoff , 2012 .

[16]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[17]  J. Garin,et al.  AT_CHLORO, a Comprehensive Chloroplast Proteome Database with Subplastidial Localization and Curated Information on Envelope Proteins* , 2010, Molecular & Cellular Proteomics.

[18]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[19]  Ruedi Aebersold,et al.  Estimation of Absolute Protein Quantities of Unlabeled Samples by Selected Reaction Monitoring Mass Spectrometry , 2011, Molecular & Cellular Proteomics.

[20]  Romesh Stanislaus,et al.  Normalization and analysis of residual variation in two‐dimensional gel electrophoresis for quantitative differential proteomics , 2005, Proteomics.

[21]  M. Gorenstein,et al.  Absolute Quantification of Proteins by LCMSE , 2006, Molecular & Cellular Proteomics.

[22]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[23]  E. Jarvis,et al.  Microproteomics: quantitative proteomic profiling of small numbers of laser-captured cells. , 2011, Cold Spring Harbor protocols.

[24]  F. Villers,et al.  Statistics for proteomics: experimental design and 2-DE differential analysis. , 2007, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[25]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[26]  Gang Wu,et al.  Integrated analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: zero-inflated Poisson regression models to predict abundance of undetected proteins , 2006, Bioinform..

[27]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[28]  L. Gatto,et al.  Deciphering Thylakoid Sub-compartments using a Mass Spectrometry-based Approach* , 2014, Molecular & Cellular Proteomics.

[29]  Jonas Grossmann,et al.  Implementation and evaluation of relative and absolute quantification in shotgun proteomics with label-free methods. , 2010, Journal of proteomics.

[30]  Tom Heskes,et al.  Empirical Bayesian random censoring threshold model improves detection of differentially abundant proteins. , 2014, Journal of proteome research.

[31]  Helen Kim,et al.  The case for well-conducted experiments to validate statistical protocols for 2D gels: different pre-processing = different lists of significant proteins , 2005, BMC biotechnology.