MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms

Summary: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions. Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial sixfold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments. We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference. We applied MSProGene on three datasets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes. Availability and implementation: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/. Contact: renardb@rki.de

[1]  D. Matthews,et al.  De novo derivation of proteomes from transcriptomes for transcript and protein identification , 2012, Nature Methods.

[2]  Trupti Joshi,et al.  Prediction of novel miRNAs and associated target genes in Glycine max , 2010, BMC Bioinformatics.

[3]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[4]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[5]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[6]  Pavel A. Pevzner,et al.  Universal database search tool for proteomics , 2014, Nature Communications.

[7]  Xiaojing Wang,et al.  Integrating Genomic, Transcriptomic, and Interactome Data to Improve Peptide and Protein Identification in Shotgun Proteomics , 2014, Journal of proteome research.

[8]  Xiaojing Wang,et al.  customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search , 2013, Bioinform..

[9]  Gennifer E. Merrihew,et al.  Proteogenomic database construction driven from large scale RNA-seq data. , 2014, Journal of proteome research.

[10]  Bernhard Y Renard,et al.  IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy , 2015, BMC Genomics.

[11]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[12]  John R Yates,et al.  Mass spectrometry in high-throughput proteomics: ready for the big time , 2010, Nature Methods.

[13]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[14]  Vineet Bafna,et al.  Annotation of the Zebrafish Genome through an Integrated Transcriptomic and Proteomic Analysis , 2014, Molecular & Cellular Proteomics.

[15]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[16]  Mark Yandell,et al.  Combined Proteomic and Transcriptomic Interrogation of the Venom Gland of Conus geographus Uncovers Novel Components and Functional Compartmentalization* , 2014, Molecular & Cellular Proteomics.

[17]  M. Claassen Inference and Validation of Protein Identifications , 2012, Molecular & Cellular Proteomics.

[18]  Dexter T. Duncan,et al.  CanProVar: a human cancer proteome variation database , 2010, Human mutation.

[19]  Bernhard Y. Renard,et al.  Specificity control for read alignments using an artificial reference genome-guided false discovery rate , 2014, Bioinform..

[20]  Karsten Krug,et al.  Construction and assessment of individualized proteogenomic databases for large‐scale analysis of nonsynonymous single nucleotide variants , 2014, Proteomics.

[21]  Natalie I. Tasman,et al.  A guided tour of the Trans‐Proteomic Pipeline , 2010, Proteomics.

[22]  William S Hancock,et al.  Proteogenomic analysis of human colon carcinoma cell lines LIM1215, LIM1899, and LIM2405. , 2013, Journal of proteome research.

[23]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[24]  V. Bafna,et al.  Proteogenomics to discover the full coding content of genomes: a computational perspective. , 2010, Journal of proteomics.

[25]  Jonathan E. Allen,et al.  Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments , 2007, Genome Biology.

[26]  J. Buhmann,et al.  Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[27]  Dong Xia,et al.  Comparative Analysis of the Secretome from a Model Filarial Nematode (Litomosoides sigmodontis) Reveals Maximal Diversity in Gravid Female Parasites* , 2014, Molecular & Cellular Proteomics.

[28]  Zengyou He,et al.  Protein inference: a review , 2012, Briefings Bioinform..

[29]  Hanno Steen,et al.  Estimating the confidence of peptide identifications without decoy databases. , 2010, Analytical chemistry.

[30]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[31]  S. Hubbard,et al.  Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies , 2012, Journal of proteome research.

[32]  B. Maček,et al.  Deep Coverage of the Escherichia coli Proteome Enables the Assessment of False Discovery Rates in Simple Proteogenomic Experiments* , 2013, Molecular & Cellular Proteomics.

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[35]  Knut Reinert,et al.  MSSimulator: Simulation of mass spectrometry data. , 2011, Journal of proteome research.

[36]  Nuno Bandeira,et al.  False discovery rates in spectral identification , 2012, BMC Bioinformatics.

[37]  Ulrich Omasits,et al.  Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome , 2013, Genome research.

[38]  Bing Zhang,et al.  Protein identification using customized protein sequence databases derived from RNA-Seq data. , 2012, Journal of proteome research.

[39]  E. Marcotte,et al.  Insights into the regulation of protein abundance from proteomic and transcriptomic analyses , 2012, Nature Reviews Genetics.

[40]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[41]  Wei Jia,et al.  Urinary Metabolite Markers of Precocious Puberty* , 2011, Molecular & Cellular Proteomics.

[42]  Bernhard Y. Renard,et al.  Overcoming Species Boundaries in Peptide Identification with Bayesian Information Criterion-driven Error-tolerant Peptide Search (BICEPS)* , 2012, Molecular & Cellular Proteomics.

[43]  Martin Kircher,et al.  Deep proteome and transcriptome mapping of a human cancer cell line , 2011, Molecular systems biology.

[44]  M. Huss,et al.  HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics , 2013, Nature Methods.

[45]  Bernhard Y. Renard,et al.  GIIRA - RNA-Seq driven gene finding incorporating ambiguous reads , 2014, Bioinform..

[46]  Gerhard G. Thallinger,et al.  A Bioinformatics Approach for Integrated Transcriptomic and Proteomic Comparative Analyses of Model and Non-sequenced Anopheline Vectors of Human Malaria Parasites* , 2012, Molecular & Cellular Proteomics.

[47]  Debasis Dash,et al.  Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. , 2011, Molecular & cellular proteomics : MCP.

[48]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[49]  A. Didangelos,et al.  Extracellular Matrix Composition and Remodeling in Human Abdominal Aortic Aneurysms: A Proteomics Approach* , 2011, Molecular & Cellular Proteomics.

[50]  Samuel H. Payne,et al.  Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[51]  Akhilesh Pandey,et al.  Proteogenomic analysis of human chromosome 9-encoded genes from human samples and lung cancer tissues. , 2014, Journal of proteome research.

[52]  Kang Ning,et al.  The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment , 2010, BMC Bioinformatics.