论文信息 - MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms

MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms

Summary: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions. Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial sixfold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments. We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference. We applied MSProGene on three datasets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes. Availability and implementation: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/. Contact: renardb@rki.de

Bernhard Y. Renard | Franziska Zickmann | B. Renard | Franziska Zickmann

[1] D. Matthews,et al. De novo derivation of proteomes from transcriptomes for transcript and protein identification , 2012, Nature Methods.

[2] Trupti Joshi,et al. Prediction of novel miRNAs and associated target genes in Glycine max , 2010, BMC Bioinformatics.

[3] I. Longden,et al. EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[4] William Stafford Noble,et al. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[5] N. Friedman,et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[6] Pavel A. Pevzner,et al. Universal database search tool for proteomics , 2014, Nature Communications.

[7] Xiaojing Wang,et al. Integrating Genomic, Transcriptomic, and Interactome Data to Improve Peptide and Protein Identification in Shotgun Proteomics , 2014, Journal of proteome research.

[8] Xiaojing Wang,et al. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search , 2013, Bioinform..

[9] Gennifer E. Merrihew,et al. Proteogenomic database construction driven from large scale RNA-seq data. , 2014, Journal of proteome research.

[10] Bernhard Y Renard,et al. IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy , 2015, BMC Genomics.

[11] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[12] John R Yates,et al. Mass spectrometry in high-throughput proteomics: ready for the big time , 2010, Nature Methods.

[13] Cole Trapnell,et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[14] Vineet Bafna,et al. Annotation of the Zebrafish Genome through an Integrated Transcriptomic and Proteomic Analysis , 2014, Molecular & Cellular Proteomics.

[15] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[16] Mark Yandell,et al. Combined Proteomic and Transcriptomic Interrogation of the Venom Gland of Conus geographus Uncovers Novel Components and Functional Compartmentalization* , 2014, Molecular & Cellular Proteomics.

[17] M. Claassen. Inference and Validation of Protein Identifications , 2012, Molecular & Cellular Proteomics.

[18] Dexter T. Duncan,et al. CanProVar: a human cancer proteome variation database , 2010, Human mutation.

[19] Bernhard Y. Renard,et al. Specificity control for read alignments using an artificial reference genome-guided false discovery rate , 2014, Bioinform..

[20] Karsten Krug,et al. Construction and assessment of individualized proteogenomic databases for large‐scale analysis of nonsynonymous single nucleotide variants , 2014, Proteomics.

[21] Natalie I. Tasman,et al. A guided tour of the Trans‐Proteomic Pipeline , 2010, Proteomics.

[22] William S Hancock,et al. Proteogenomic analysis of human colon carcinoma cell lines LIM1215, LIM1899, and LIM2405. , 2013, Journal of proteome research.

[23] Cole Trapnell,et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[24] V. Bafna,et al. Proteogenomics to discover the full coding content of genomes: a computational perspective. , 2010, Journal of proteomics.

[25] Jonathan E. Allen,et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments , 2007, Genome Biology.

[26] J. Buhmann,et al. Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[27] Dong Xia,et al. Comparative Analysis of the Secretome from a Model Filarial Nematode (Litomosoides sigmodontis) Reveals Maximal Diversity in Gravid Female Parasites* , 2014, Molecular & Cellular Proteomics.

[28] Zengyou He,et al. Protein inference: a review , 2012, Briefings Bioinform..

[29] Hanno Steen,et al. Estimating the confidence of peptide identifications without decoy databases. , 2010, Analytical chemistry.

[30] M. Gerstein,et al. RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[31] S. Hubbard,et al. Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies , 2012, Journal of proteome research.

[32] B. Maček,et al. Deep Coverage of the Escherichia coli Proteome Enables the Assessment of False Discovery Rates in Simple Proteogenomic Experiments* , 2013, Molecular & Cellular Proteomics.

[33] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34] Manuel Holtgrewe,et al. Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[35] Knut Reinert,et al. MSSimulator: Simulation of mass spectrometry data. , 2011, Journal of proteome research.

[36] Nuno Bandeira,et al. False discovery rates in spectral identification , 2012, BMC Bioinformatics.

[37] Ulrich Omasits,et al. Directed shotgun proteomics guided by saturated RNA-seq identifies a complete expressed prokaryotic proteome , 2013, Genome research.

[38] Bing Zhang,et al. Protein identification using customized protein sequence databases derived from RNA-Seq data. , 2012, Journal of proteome research.

[39] E. Marcotte,et al. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses , 2012, Nature Reviews Genetics.

[40] S. Nelson,et al. BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[41] Wei Jia,et al. Urinary Metabolite Markers of Precocious Puberty* , 2011, Molecular & Cellular Proteomics.

[42] Bernhard Y. Renard,et al. Overcoming Species Boundaries in Peptide Identification with Bayesian Information Criterion-driven Error-tolerant Peptide Search (BICEPS)* , 2012, Molecular & Cellular Proteomics.

[43] Martin Kircher,et al. Deep proteome and transcriptome mapping of a human cancer cell line , 2011, Molecular systems biology.

[44] M. Huss,et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics , 2013, Nature Methods.

[45] Bernhard Y. Renard,et al. GIIRA - RNA-Seq driven gene finding incorporating ambiguous reads , 2014, Bioinform..

[46] Gerhard G. Thallinger,et al. A Bioinformatics Approach for Integrated Transcriptomic and Proteomic Comparative Analyses of Model and Non-sequenced Anopheline Vectors of Human Malaria Parasites* , 2012, Molecular & Cellular Proteomics.

[47] Debasis Dash,et al. Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. , 2011, Molecular & cellular proteomics : MCP.

[48] A. Nesvizhskii. Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[49] A. Didangelos,et al. Extracellular Matrix Composition and Remodeling in Human Abdominal Aortic Aneurysms: A Proteomics Approach* , 2011, Molecular & Cellular Proteomics.

[50] Samuel H. Payne,et al. Discovery and revision of Arabidopsis genes by proteogenomics , 2008, Proceedings of the National Academy of Sciences.

[51] Akhilesh Pandey,et al. Proteogenomic analysis of human chromosome 9-encoded genes from human samples and lung cancer tissues. , 2014, Journal of proteome research.

[52] Kang Ning,et al. The utility of mass spectrometry-based proteomic data for validation of novel alternative splice forms reconstructed from RNA-Seq data: a preliminary assessment , 2010, BMC Bioinformatics.