Identifier mapping performance for integrating transcriptomics and proteomics experimental results

BackgroundStudies integrating transcriptomic data with proteomic data can illuminate the proteome more clearly than either separately. Integromic studies can deepen understanding of the dynamic complex regulatory relationship between the transcriptome and the proteome. Integrating these data dictates a reliable mapping between the identifier nomenclature resultant from the two high-throughput platforms. However, this kind of analysis is well known to be hampered by lack of standardization of identifier nomenclature among proteins, genes, and microarray probe sets. Therefore data integration may also play a role in critiquing the fallible gene identifications that both platforms emit.ResultsWe compared three freely available internet-based identifier mapping resources for mapping UniProt accessions (ACCs) to Affymetrix probesets identifications (IDs): DAVID, EnVision, and NetAffx. Liquid chromatography-tandem mass spectrometry analyses of 91 endometrial cancer and 7 noncancer samples generated 11,879 distinct ACCs. For each ACC, we compared the retrieval sets of probeset IDs from each mapping resource. We confirmed a high level of discrepancy among the mapping resources. On the same samples, mRNA expression was available. Therefore, to evaluate the quality of each ACC-to-probeset match, we calculated proteome-transcriptome correlations, and compared the resources presuming that better mapping of identifiers should generate a higher proportion of mapped pairs with strong inter-platform correlations. A mixture model for the correlations fitted well and supported regression analysis, providing a window into the performance of the mapping resources. The resources have added and dropped matches over two years, but their overall performance has not changed.ConclusionsThe methods presented here serve to achieve concrete context-specific insight, to support well-informed decisions in choosing an ID mapping strategy for "omic" data merging.

[1]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[2]  J. Yates,et al.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. , 1995, Analytical chemistry.

[3]  R. Aebersold,et al.  Proteome-wide cellular protein concentrations of the human pathogen Leptospira interrogans , 2009, Nature.

[4]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[5]  Mark Gerstein,et al.  PARE: A tool for comparing protein abundance and mRNA expression data , 2007, BMC Bioinformatics.

[6]  Jae K. Lee,et al.  Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study , 2007, Molecular Cancer Therapeutics.

[7]  Purvesh Khatri,et al.  Babel's tower revisited: a universal resource for cross-referencing across annotation databases , 2006, Bioinform..

[8]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[9]  Guoying Liu,et al.  NetAffx: Affymetrix probesets and annotations , 2003, Nucleic Acids Res..

[10]  Andrew B. Clegg,et al.  ENFIN--A European network for integrative systems biology. , 2009, Comptes rendus biologies.

[11]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[12]  Ramil N. Nurtdinov,et al.  PLANdbAffy: probe-level annotation database for Affymetrix expression microarrays , 2009, Nucleic Acids Res..

[13]  Gang Wu,et al.  Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: a multiple regression to identify sources of variations. , 2006, Biochemical and biophysical research communications.

[14]  R. Day,et al.  Proteomic analysis of stage I endometrial cancer tissue: identification of proteins associated with oxidative processes and inflammation. , 2011, Gynecologic oncology.

[15]  David E. Misek,et al.  Discordant Protein and mRNA Expression in Lung Adenocarcinomas * , 2002, Molecular & Cellular Proteomics.

[16]  David Liu,et al.  DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis , 2007, BMC Bioinformatics.

[17]  William C Reinhold,et al.  MatchMiner: a tool for batch navigation among gene and gene product identifiers , 2003, Genome Biology.

[18]  M. Gerstein,et al.  Comparing protein abundance and mRNA expression levels on a genomic scale , 2003, Genome Biology.

[19]  David Bryant,et al.  DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists , 2007, Nucleic Acids Res..

[20]  Qing-Rong Chen,et al.  Global genomic and proteomic analysis identifies biological pathways related to high-risk neuroblastoma. , 2010, Journal of proteome research.

[21]  D. Chan,et al.  Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. , 2002, Clinical chemistry.

[22]  Robert C. Bast,et al.  Selection of Potential Markers for Epithelial Ovarian Cancer with Gene Expression Arrays and Recursive Descent Partition Analysis , 2004, Clinical Cancer Research.