The necessity of adjusting tests of protein category enrichment in discovery proteomics

MOTIVATION Enrichment tests are used in high-throughput experimentation to measure the association between gene or protein expression and membership in groups or pathways. The Fisher's exact test is commonly used. We specifically examined the associations produced by the Fisher test between protein identification by mass spectrometry discovery proteomics, and their Gene Ontology (GO) term assignments in a large yeast dataset. We found that direct application of the Fisher test is misleading in proteomics due to the bias in mass spectrometry to preferentially identify proteins based on their biochemical properties. False inference about associations can be made if this bias is not corrected. Our method adjusts Fisher tests for these biases and produces associations more directly attributable to protein expression rather than experimental bias. RESULTS Using logistic regression, we modeled the association between protein identification and GO term assignments while adjusting for identification bias in mass spectrometry. The model accounts for five biochemical properties of peptides: (i) hydrophobicity, (ii) molecular weight, (iii) transfer energy, (iv) beta turn frequency and (v) isoelectric point. The model was fit on 181 060 peptides from 2678 proteins identified in 24 yeast proteomics datasets with a 1% false discovery rate. In analyzing the association between protein identification and their GO term assignments, we found that 25% (134 out of 544) of Fisher tests that showed significant association (q-value ≤0.05) were non-significant after adjustment using our model. Simulations generating yeast protein sets enriched for identification propensity show that unadjusted enrichment tests were biased while our approach worked well.

[1]  E. Kolker,et al.  Protein identification and expression analysis using mass spectrometry. , 2006, Trends in microbiology.

[2]  Kara Dolinski,et al.  Gene Ontology annotations at SGD: new data sources and annotation methods , 2007, Nucleic Acids Res..

[3]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[4]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  E. O’Shea,et al.  Global analysis of protein expression in yeast , 2003, Nature.

[7]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[8]  Rong Wang,et al.  The APEX Quantitative Proteomics Tool: Generating protein quantitation estimates from LC-MS/MS proteomics results , 2008, BMC Bioinformatics.

[9]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[10]  Eugene Kolker,et al.  Estimating false discovery rates for peptide and protein identification using randomized databases , 2010, Proteomics.

[11]  Eugene Kolker,et al.  A predictive model for identifying proteins by a single peptide match , 2007, Bioinform..

[12]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[13]  Winston Haynes,et al.  Meta-analysis for protein identification: a case study on yeast data. , 2010, Omics : a journal of integrative biology.

[14]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[15]  M. Mann,et al.  Proteomics to study genes and genomes , 2000, Nature.

[16]  Eugene Kolker,et al.  A note on the false discovery rate and inconsistent comparisons between experiments , 2008, Bioinform..

[17]  P. McCullagh,et al.  Generalized Linear Models , 1972, Predictive Analytics.

[18]  E. Kolker,et al.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions , 2009, PloS one.

[19]  R. Aebersold,et al.  Comparative Functional Analysis of the Caenorhabditis elegans and Drosophila melanogaster Proteomes , 2009, PLoS biology.

[20]  Daniel B. Martin,et al.  Computational prediction of proteotypic peptides for quantitative proteomics , 2007, Nature Biotechnology.

[21]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[22]  J. Mesirov,et al.  Prediction of high-responding peptides for targeted protein assays by mass spectrometry , 2009, Nature Biotechnology.