Tail-Robust Quantile Normalization

High-throughput biological data – such as mass spectrometry-based proteomics data – suffer from systematic non-biological variance, which is introduced by systematic errors such as batch effects. This hinders the estimation of ‘real’ biological signals and, thus, decreases the power of statistical tests and biases the identification of differentially expressed sample classes. To remove such unintended variation, while retaining the biological signal of interest, the analysis workflows for mass spectrometry-based quantification typically comprises normalization steps prior to the statistical analysis of the data. Several normalization methods, such as quantile normalization, have originally been developed for microarray data. However, unlike microarray data, proteomics data may contain features, in the form of protein intensities, that are consistently highly abundant across experimental conditions and, hence, are encountered in the tails of the protein intensity distribution. If such proteins are present, statistical inferences of the intensity profiles of the normalized features are impeded through the increased number of false positive findings due to the biased estimation of the variance of the data. Thus, we developed a, freely available, novel approach: ‘tail-robust quantile normalization’. It extends the traditional quantile normalization to preserve the biological signals of features in the tails of the distribution over experimental conditions and to account for sample-dependent missing values.

[1]  B. Usadel,et al.  Quantitation in mass-spectrometry-based proteomics. , 2010, Annual review of plant biology.

[2]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[3]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[4]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[5]  Limsoon Wong,et al.  Why Batch Effects Matter in Omics Data, and How to Avoid Them. , 2017, Trends in biotechnology.

[6]  Marco Y. Hein,et al.  Accurate Proteome-wide Label-free Quantification by Delayed Normalization and Maximal Peptide Ratio Extraction, Termed MaxLFQ * , 2014, Molecular & Cellular Proteomics.

[7]  Javier Cabrera,et al.  Analysis of Data From Viral DNA Microchips , 2001 .

[8]  W. A. Morgan TEST FOR THE SIGNIFICANCE OF THE DIFFERENCE BETWEEN THE TWO VARIANCES IN A SAMPLE FROM A NORMAL BIVARIATE POPULATION , 1939 .

[9]  E. Pitman A NOTE ON NORMAL CORRELATION , 1939 .

[10]  Joel G Pounds,et al.  A statistical selection strategy for normalization procedures in LC‐MS proteomics experiments through dataset‐dependent ranking of normalization scaling factors , 2011, Proteomics.

[11]  Mario Looso,et al.  Proteotranscriptomics Reveal Signaling Networks in the Ovarian Cancer Microenvironment* , 2017, Molecular & Cellular Proteomics.

[12]  Thomas Henry,et al.  Importance of Host Cell Arginine Uptake in Francisella Phagosomal Escape and Ribosomal Protein Amounts* , 2015, Molecular & Cellular Proteomics.