Protein identification using customized protein sequence databases derived from RNA-Seq data.

The standard shotgun proteomics data analysis strategy relies on searching MS/MS spectra against a context-independent protein sequence database derived from the complete genome sequence of an organism. Because transcriptome sequence analysis (RNA-Seq) promises an unbiased and comprehensive picture of the transcriptome, we reason that a sample-specific protein database derived from RNA-Seq data can better approximate the real protein pool in the sample and thus improve protein identification. In this study, we have developed a two-step strategy for building sample-specific protein databases from RNA-Seq data. First, the database size is reduced by eliminating unexpressed or lowly expressed genes according to transcript quantification. Second, high-quality nonsynonymous coding single nucleotide variations (SNVs) are identified based on RNA-Seq data, and corresponding protein variants are added to the database. Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines SW480 and RKO, we demonstrated that customized protein sequence databases could significantly increase the sensitivity of peptide identification, reduce ambiguity in protein assembly, and enable the detection of known and novel peptide variants. Thus, sample-specific databases from RNA-Seq data can enable more sensitive and comprehensive protein discovery in shotgun proteomics studies.

[1]  D. Tabb,et al.  MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. , 2007, Journal of proteome research.

[2]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[3]  Eric T. Wang,et al.  An Abundance of Ubiquitously Expressed Genes Revealed by Tissue Transcriptome Sequence Data , 2009, PLoS Comput. Biol..

[4]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[5]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[6]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[7]  L. Hood,et al.  Complementary Profiling of Gene Expression at the Transcriptome and Proteome Levels in Saccharomyces cerevisiae*S , 2002, Molecular & Cellular Proteomics.

[8]  Dustin E. Schones,et al.  Monovalent and unpoised status of most genes in undifferentiated cell-enriched Drosophila testis , 2010, Genome Biology.

[9]  E. Wagner,et al.  Metabolism and regulation of canonical histone mRNAs: life without a poly(A) tail , 2008, Nature Reviews Genetics.

[10]  L. Pearl,et al.  Structure and mechanism of the Hsp90 molecular chaperone machinery. , 2006, Annual review of biochemistry.

[11]  M. Gerstein,et al.  Comparing protein abundance and mRNA expression levels on a genomic scale , 2003, Genome Biology.

[12]  J. Yates,et al.  A model for random sampling and estimation of relative protein abundance in shotgun proteomics. , 2004, Analytical chemistry.

[13]  S. Le,et al.  Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line , 2010, Molecular systems biology.

[14]  A. Nesvizhskii A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. , 2010, Journal of proteomics.

[15]  D. Schriemer,et al.  Integration of deep transcriptome and proteome analyses reveals the components of alkaloid metabolism in opium poppy cell cultures , 2010, BMC Plant Biology.

[16]  Yi-Kuo Yu,et al.  RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration , 2008, BMC Genomics.

[17]  D. Tabb,et al.  Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. , 2007, Journal of proteome research.

[18]  Kang Ning,et al.  The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment , 2010, BMC Bioinformatics.

[19]  W. Pao,et al.  A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics* , 2011, Molecular & Cellular Proteomics.

[20]  V. Thorsson,et al.  Integrated Genomic and Proteomic Analyses of Gene Expression in Mammalian Cells*S , 2004, Molecular & Cellular Proteomics.

[21]  M. Mann,et al.  Defining the transcriptome and proteome in three functionally different human cell lines , 2010, Molecular systems biology.

[22]  Damian Fermin,et al.  Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics , 2006, Genome Biology.

[23]  Christoph Dieterich,et al.  De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. , 2011, Genome research.

[24]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[25]  E. Winzeler,et al.  Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Rong Wang,et al.  Integrating shotgun proteomics and mRNA expression data to improve protein identification , 2009, Bioinform..

[27]  N. Samatova,et al.  Detecting differential and correlated protein expression in label-free shotgun proteomics. , 2006, Journal of proteome research.

[28]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[29]  M. Mihailović,et al.  Molecular characterization of hsp90 isoforms in colorectal cancer cells and its association with tumour progression. , 1992, International journal of oncology.

[30]  Michael D. Litton,et al.  IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. , 2009, Journal of proteome research.

[31]  Minho Won,et al.  Sustained activation of protein kinase C downregulates nuclear factor-κB signaling by dissociation of IKK-γ and Hsp90 complex in human colonic epithelial cells , 2007 .

[32]  R. Hendrickson,et al.  Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data. , 2007, Journal of proteome research.

[33]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[34]  Steffen Heber,et al.  Detection of alternative splice variants at the proteome level in Aspergillus flavus. , 2010, Journal of proteome research.

[35]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[36]  Gang Liu,et al.  Suppression of inhibitor of differentiation 2, a target of mutant p53, is required for gain-of-function mutations. , 2008, Cancer research.

[37]  Blagoy Blagoev,et al.  A mass spectrometry–friendly database for cSNP identification , 2007, Nature Methods.

[38]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[39]  E. Birney,et al.  Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt , 2009, Nature Protocols.

[40]  R. Aebersold,et al.  Applying mass spectrometry-based proteomics to genetics, genomics and network biology , 2009, Nature Reviews Genetics.

[41]  P. Workman,et al.  HSP90 as a new therapeutic target for cancer therapy: the story unfolds , 2002, Expert opinion on biological therapy.

[42]  N. Edwards,et al.  Novel peptide identification from tandem mass spectra using ESTs and sequence database compression , 2007, Molecular systems biology.

[43]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[44]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[45]  Li Yang,et al.  Genomewide characterization of non-polyadenylated RNAs , 2011, Genome Biology.

[46]  G. Blandino,et al.  Mutant p53 gain of function: reduction of tumor malignancy of human cancer cell lines through abrogation of mutant p53 expression , 2006, Oncogene.

[47]  P. Khaitovich,et al.  BMC Genomics BioMed Central Methodology article Estimating accuracy of RNA-Seq and microarrays with proteomics , 2022 .