RobNorm: Model-Based Robust Normalization Method for Labeled Quantitative Mass Spectrometry Proteomics Data

Motivation Data normalization is an important step in processing proteomics data generated in mass spectrometry (MS) experiments, which aims to reduce sample-level variation and facilitate comparisons of samples. Previously published methods for normalization primarily depend on the assumption that the distribution of protein expression is similar across all samples. However, this assumption fails when the protein expression data is generated from heterogenous samples, such as from various tissue types. This led us to develop a novel data-driven method for improved normalization to correct the systematic bias meanwhile maintaining underlying biological heterogeneity. Methods To robustly correct the systematic bias, we used the density-power-weight method to down-weigh outliers and extended the one-dimensional robust fitting method described in the previous work of (Windham, 1995, Fujisawa and Eguchi, 2008) to our structured data. We then constructed a robustness criterion and developed a new normalization algorithm, called RobNorm. Results In simulation studies and analysis of real data from the genotype-tissue expression (GTEx) project, we compared and evaluated the performance of RobNorm against other normalization methods. We found that the RobNorm approach exhibits the greatest reduction in systematic bias while maintaining across-tissue variation, especially for datasets from highly heterogeneous samples. Availability https://github.com/mwgrassgreen/RobNorm Contact huatang@stanford.edu and mpsnyder@stanford.edu

[1]  Terry M Therneau,et al.  Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. , 2008, Journal of proteome research.

[2]  H. Senn,et al.  Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. , 2006, Analytical chemistry.

[3]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[4]  M. P. Windham Robustifying Model Fitting , 1995 .

[5]  Joshua N. Adkins,et al.  Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition , 2009, Bioinform..

[6]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[7]  Per E. Andrén,et al.  Development and Evaluation of Normalization Methods for Label-free Relative Quantification of Endogenous Peptides* , 2009, Molecular & Cellular Proteomics.

[8]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[9]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[10]  T. Therneau,et al.  A statistical model for iTRAQ data analysis. , 2008, Journal of proteome research.

[11]  Stephen J. Callister,et al.  Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. , 2006, Journal of proteome research.

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  Christopher D. Brown,et al.  A Quantitative Proteome Map of the Human Body , 2019, Cell.

[14]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[15]  Laura L. Elo,et al.  A systematic evaluation of normalization methods in quantitative label-free proteomics , 2016, Briefings Bioinform..

[16]  Lily Ting,et al.  Normalization and Statistical Analysis of Quantitative Proteomics Data Generated by Metabolic Labeling* , 2009, Molecular & Cellular Proteomics.

[17]  Richard D. Smith,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[18]  M. C. Jones,et al.  Robust and efficient estimation by minimising a density power divergence , 1998 .

[19]  Marco Y. Hein,et al.  Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ * , 2014, Molecular & Cellular Proteomics.

[20]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[21]  Stefan Tenzer,et al.  In‐depth evaluation of software tools for data‐independent acquisition based label‐free quantification , 2015, Proteomics.

[22]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[23]  Terry M. Therneau,et al.  Faster cyclic loess: normalizing RNA arrays via linear models , 2004, Bioinform..

[24]  Fredrik Levander,et al.  Normalyzer: A Tool for Rapid Evaluation of Normalization Methods for Omics Data Sets , 2014, Journal of proteome research.

[25]  Ann L. Oberg,et al.  Statistical methods for quantitative mass spectrometry proteomic experiments with labeling , 2012, BMC Bioinformatics.

[26]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[27]  HighWire Press,et al.  Molecular & cellular proteomics , 2002 .

[28]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[29]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[30]  S. Eguchi,et al.  Robust parameter estimation with a small bias against heavy contamination , 2008 .