Proteogenomics: From next-generation sequencing (NGS) and mass spectrometry-based proteomics to precision medicine.

One of the best-established area within multi-omics is proteogenomics, whereby the underpinning technologies are next-generation sequencing (NGS) and mass spectrometry (MS). Proteogenomics has contributed significantly to genome (re)-annotation, whereby novel coding sequences (CDS) are identified and confirmed. By incorporating in-silico translated genome variants in protein database, single amino acid variants (SAAV) and splice proteoforms can be identified and quantified at peptide level. The application of proteogenomics in cancer research potentially enables the identification of patient-specific proteoforms, as well as the association of the efficacy or resistance of cancer therapy to different mutations. Here, we discuss how NGS/TGS data are analyzed and incorporated into the proteogenomic framework. These sequence data mainly originate from whole genome sequencing (WGS), whole exome sequencing (WES) and RNA-Seq. We explain two major strategies for sequence analysis i.e., de novo assembly and reads mapping, followed by construction of customized protein databases using such data. Besides, we also elaborate on the procedures of spectrum to peptide sequence matching in proteogenomics, and the relationship between database size on the false discovery rate (FDR). Finally, we discuss the latest development in proteogenomics-assisted precision oncology and also challenges and opportunities in proteogenomics research.

[1]  J. Armengaud,et al.  Proteogenomic Insights into the Intestinal Parasite Blastocystis sp. Subtype 4 Isolate WR1 , 2017, Proteomics.

[2]  Jarrett D. Egertson,et al.  Multiplexed MS/MS for Improved Data Independent Acquisition , 2013, Nature Methods.

[3]  Fahad Saeed,et al.  Big data proteogenomics and high performance computing: Challenges and opportunities , 2015, 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[4]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[5]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[6]  Samuel H Payne,et al.  Methods, Tools and Current Perspectives in Proteogenomics * , 2017, Molecular & Cellular Proteomics.

[7]  Ludovic C. Gillet,et al.  Data‐independent acquisition‐based SWATH‐MS for quantitative proteomics: a tutorial , 2018, Molecular systems biology.

[8]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[9]  Frode S Berven,et al.  Use of stable isotope dimethyl labeling coupled to selected reaction monitoring to enhance throughput by multiplexing relative quantitation of targeted proteins. , 2012, Analytical chemistry.

[10]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[11]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[12]  Mia Yang Ang,et al.  Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology , 2018, Proteomics.

[13]  T. Wetter,et al.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. , 2004, Genome research.

[14]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  Yan Li,et al.  Sequencing and de novo assembly of a near complete indica rice genome , 2017, Nature Communications.

[16]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[17]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[18]  M. V. Ivanov,et al.  Proteogenomics of Adenosine-to-Inosine RNA Editing in the Fruit Fly. , 2018, Journal of proteome research.

[19]  Bernhard Y. Renard,et al.  Evaluating de novo sequencing in proteomics: already an accurate alternative to database‐driven peptide identification? , 2018, Briefings Bioinform..

[20]  Xiaojing Wang,et al.  customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search , 2013, Bioinform..

[21]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[22]  Lennart Martens,et al.  An Accessible Proteogenomics Informatics Resource for Cancer Researchers. , 2017, Cancer research.

[23]  Ruifu Yang,et al.  Reannotation of Yersinia pestis Strain 91001 Based on Omics Data. , 2016, The American journal of tropical medicine and hygiene.

[24]  Markus Müller,et al.  Processing strategies and software solutions for data‐independent acquisition in mass spectrometry , 2015, Proteomics.

[25]  Sara Goodwin,et al.  Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome , 2015, bioRxiv.

[26]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[27]  D. Hume,et al.  Exome Sequencing: Current and Future Perspectives , 2015, G3: Genes, Genomes, Genetics.

[28]  William Stafford Noble,et al.  Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. , 2010, Journal of proteome research.

[29]  Xun Xu,et al.  sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments , 2014, Bioinform..

[30]  J. Buhmann,et al.  Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry* , 2009, Molecular & Cellular Proteomics.

[31]  Alexey I Nesvizhskii,et al.  Analysis and validation of proteomic data generated by tandem mass spectrometry , 2007, Nature Methods.

[32]  Matthias Mann,et al.  Plasma Proteome Profiling to Assess Human Health and Disease. , 2016, Cell systems.

[33]  S. Rocchiccioli,et al.  Proteomics techniques for the detection of translated pseudogenes. , 2014, Methods in molecular biology.

[34]  Richard D. Smith,et al.  Moonshot Objectives: Catalyze New Scientific Breakthroughs—Proteogenomics , 2018, Cancer journal.

[35]  John R Yates,et al.  Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate. , 2016, Journal of proteome research.

[36]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[37]  Yixue Li,et al.  Identification of gene fusions from human lung cancer mass spectrometry data , 2013, BMC Genomics.

[38]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[39]  D. Dash,et al.  Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes. , 2016, Advances in experimental medicine and biology.

[40]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[41]  Ruedi Aebersold,et al.  High-throughput generation of selected reaction-monitoring assays for proteins and proteomes , 2010, Nature Methods.

[42]  Johannes Griss,et al.  Expanding the Use of Spectral Libraries in Proteomics. , 2018, Journal of proteome research.

[43]  Ronald J. Moore,et al.  Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer , 2016, Cell.

[44]  Eduard Sabidó,et al.  What is targeted proteomics? A concise revision of targeted acquisition and targeted data analysis in mass spectrometry , 2017, Proteomics.

[45]  William B. Langdon,et al.  Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks , 2015, BioData Mining.

[46]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[47]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[48]  B. Tian,et al.  RNA‐Seq methods for transcriptome analysis , 2017, Wiley interdisciplinary reviews. RNA.

[49]  Shiyi Chen,et al.  Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction , 2016, Computational and structural biotechnology journal.

[50]  Andre Franke,et al.  Opportunities and challenges of whole-genome and -exome sequencing , 2017, BMC Genetics.

[51]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[52]  D. Frishman,et al.  Annotation of the Domestic Pig Genome by Quantitative Proteogenomics. , 2017, Journal of proteome research.

[53]  Ruedi Aebersold,et al.  Applications and Developments in Targeted Proteomics: From SRM to DIA/SWATH , 2016, Proteomics.

[54]  Steven Piantadosi,et al.  Wearable activity monitors in oncology trials: Current use of an emerging technology. , 2018, Contemporary clinical trials.

[55]  P. Boutros,et al.  Onco-proteogenomics: cancer proteomics joins forces with genomics , 2014, Nature Methods.

[56]  Xun Xu,et al.  PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq , 2016, BMC Bioinformatics.

[57]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[58]  P. Pevzner,et al.  Spectral Profiles, a Novel Representation of Tandem Mass Spectra and Their Applications for De Novo Peptide Sequencing and Identification* □ S , 2022 .

[59]  Lovelace J. Luquette,et al.  A Pan-Cancer Proteogenomic Atlas of PI3K/AKT/mTOR Pathway Alterations. , 2017, Cancer cell.

[60]  Ludovic C. Gillet,et al.  Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis* , 2012, Molecular & Cellular Proteomics.

[61]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[62]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[63]  Michael L. Gatza,et al.  Proteogenomics connects somatic mutations to signaling in breast cancer , 2016, Nature.

[64]  R. Pazdur,et al.  Next-Generation Sequencing in Oncology in the Era of Precision Medicine. , 2016, JAMA oncology.

[65]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[66]  Srikanth S. Manda,et al.  Identification and characterization of proteins encoded by chromosome 12 as part of chromosome-centric human proteome project. , 2014, Journal of proteome research.

[67]  S. V. Heesch,et al.  University of Groningen Quantitative and Qualitative Proteome Characteristics Extracted from In-Depth Integrated Genomics and Proteomics Analysis , 2018 .

[68]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[69]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[70]  Jacob D. Jaffe,et al.  Proteogenomics: Opportunities and Caveats. , 2016, Clinical chemistry.

[71]  C. Burge,et al.  Prediction of Mammalian MicroRNA Targets , 2003, Cell.

[72]  Steven P Gygi,et al.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry , 2007, Nature Methods.

[73]  A. Heck,et al.  Six alternative proteases for mass spectrometry–based proteomics beyond trypsin , 2016, Nature Protocols.

[74]  Jeffrey R. Whiteaker,et al.  Proteogenomic characterization of human colon and rectal cancer , 2014, Nature.

[75]  Ekaterina Mostovenko,et al.  Comparison of peptide and protein fractionation methods in proteomics , 2013 .

[76]  Lennart Martens,et al.  Database Search Engines: Paradigms, Challenges and Solutions. , 2016, Advances in experimental medicine and biology.

[77]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[78]  Kyu-Baek Hwang,et al.  Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification , 2016, BMC Genomics.

[79]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[80]  R. Guigó,et al.  An assessment of gene prediction accuracy in large DNA sequences. , 2000, Genome research.

[81]  T. Kondo Proteogenomics for the Study of Gastrointestinal Stromal Tumors. , 2016, Advances in experimental medicine and biology.

[82]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[83]  B. Shen,et al.  A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites , 2014, Proteomics.

[84]  R. Kurzrock,et al.  Debunking the Delusion That Precision Oncology Is an Illusion. , 2017, The oncologist.

[85]  Hal Hodson,et al.  Google DeepMind and healthcare in an age of algorithms , 2017, Health and Technology.

[86]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[87]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[88]  Samuel H. Payne,et al.  Proteogenomic strategies for identification of aberrant cancer peptides using large‐scale next‐generation sequencing data , 2014, Proteomics.

[89]  James E. Johnson,et al.  Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations , 2014, BMC Genomics.

[90]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[91]  T. Nishimura,et al.  Developments for Personalized Medicine of Lung Cancer Subtypes: Mass Spectrometry-Based Clinical Proteogenomic Analysis of Oncogenic Mutations. , 2016, Advances in experimental medicine and biology.

[92]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[93]  Gerben Menschaert,et al.  Identification of Small Novel Coding Sequences, a Proteogenomics Endeavor. , 2016, Advances in experimental medicine and biology.

[94]  Masaru Tomita,et al.  Onco-proteogenomics: a novel approach to identify cancer-specific mutations combining proteomics and transcriptome deep sequencing , 2010, Genome Biology.

[95]  M. MacCoss,et al.  Maximizing Peptide Identification Events in Proteomic Workflows Using Data-Dependent Acquisition (DDA)* , 2013, Molecular & Cellular Proteomics.

[96]  B. Rood,et al.  A Proteogenomic Approach to Understanding MYC Function in Metastatic Medulloblastoma Tumors , 2016, International journal of molecular sciences.

[97]  S. Sauer,et al.  Efficient Application of De Novo RNA Assemblers for Proteomics Informed by Transcriptomics. , 2016, Journal of proteome research.

[98]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[99]  C. Ferreira,et al.  Unveiling alterative splice diversity from human oligodendrocyte proteome data. , 2017, Journal of proteomics.

[100]  Young-Jin Choi,et al.  Proteogenomic Study beyond Chromosome 9: New Insight into Expressed Variant Proteome and Transcriptome in Human Lung Adenocarcinoma Tissues. , 2015, Journal of proteome research.

[101]  Mehdi Mesri,et al.  Linking cancer genome to proteome: NCI's investment into proteogenomics , 2014, Proteomics.

[102]  Henry Rodriguez,et al.  Revolutionizing Precision Oncology through Collaborative Proteogenomics and Data Sharing , 2018, Cell.

[103]  Paul C. Boutros,et al.  Detecting protein variants by mass spectrometry: a comprehensive study in cancer cell-lines , 2017, Genome Medicine.

[104]  John D. Venable,et al.  Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra , 2004, Nature Methods.

[105]  Joel A. Kooren,et al.  A two‐step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies , 2013, Proteomics.

[106]  S. Lemieux,et al.  Proteogenomic-based discovery of minor histocompatibility antigens with suitable features for immunotherapy of hematologic cancers , 2016, Leukemia.