CusVarDB: A tool for building customized sample-specific variant protein database from next-generation sequencing datasets

Cancer genome sequencing studies have revealed a number of variants in coding regions of several genes. Some of these coding variants play an important role in activating specific pathways that drive proliferation. Coding variants present on cancer cell surfaces by the major histocompatibility complex serve as neo-antigens and result in immune activation. The success of immune therapy in patients is attributed to neo-antigen load on cancer cell surfaces. However, which coding variants are expressed at the protein level can’t be predicted based on genomic data. Complementing genomic data with proteomic data can potentially reveal coding variants that are expressed at the protein level. However, identification of variant peptides using mass spectrometry data is still a challenging task due to the lack of an appropriate tool that integrates genomic and proteomic data analysis pipelines. To overcome this problem, and for the ease of the biologists, we have developed a graphical user interface (GUI)-based tool called CusVarDB. We integrated variant calling pipeline to generate sample-specific variant protein database from next-generation sequencing datasets. We validated the tool with triple negative breast cancer cell line datasets and identified 423, 408, 386 and 361 variant peptides from BT474, MDMAB157, MFM223 and HCC38 datasets, respectively.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  S. Mathivanan,et al.  Identifying mutated proteins secreted by colon cancer cell lines using mass spectrometry. , 2012, Journal of proteomics.

[3]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[4]  David Fenyö,et al.  Next Generation Sequencing Data and Proteogenomics. , 2016, Advances in experimental medicine and biology.

[5]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[6]  S. Pinto,et al.  Proteogenomics for understanding oncology: recent advances and future prospects , 2016, Expert review of proteomics.

[7]  Roy S Herbst,et al.  EGFR Mutations in Non-Small-Cell Lung Cancer: Find, Divide, and Conquer. , 2015, JAMA oncology.

[8]  Akhilesh Pandey,et al.  Identification of differentially expressed serum proteins in gastric adenocarcinoma. , 2015, Journal of proteomics.

[9]  Mingming Jia,et al.  COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer , 2009, Nucleic Acids Res..

[10]  Paul C. Boutros,et al.  Detecting protein variants by mass spectrometry: a comprehensive study in cancer cell-lines , 2017, Genome Medicine.

[11]  Shicai Wang,et al.  COSMIC: the Catalogue Of Somatic Mutations In Cancer , 2018, Nucleic Acids Res..

[12]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[13]  Shivashankar H. Nagaraj,et al.  PGTools: A Software Suite for Proteogenomic Data Analysis and Visualization. , 2015, Journal of proteome research.

[14]  Laura M. Heiser,et al.  Modeling precision treatment of breast cancer , 2013, Genome Biology.

[15]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[16]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[17]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[18]  D. Fenyö,et al.  Proteogenomics from a bioinformatics angle: A growing field. , 2015, Mass spectrometry reviews.

[19]  Syed Haider,et al.  International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data , 2011, Database J. Biol. Databases Curation.

[20]  A. Nesvizhskii Proteogenomics: concepts, applications and computational strategies , 2014, Nature Methods.

[21]  J. Castle,et al.  Mutant MHC class II epitopes drive therapeutic immune responses to cancer , 2015, Nature.

[22]  A. Hauschild,et al.  Improved survival with vemurafenib in melanoma with BRAF V600E mutation. , 2011, The New England journal of medicine.

[23]  Mehdi Mesri,et al.  Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium. , 2013, Cancer discovery.

[24]  Xiaojing Wang,et al.  customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search , 2013, Bioinform..

[25]  Byungho Lim,et al.  A proteogenomic approach for protein-level evidence of genomic variants in cancer cells , 2016, Scientific Reports.

[26]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[27]  Vineet Bafna,et al.  Annotation of the Zebrafish Genome through an Integrated Transcriptomic and Proteomic Analysis , 2014, Molecular & Cellular Proteomics.

[28]  Su-In Lee,et al.  The proteomic landscape of triple-negative breast cancer. , 2015, Cell reports.