Identification of HPV integration and gene mutation in HeLa cell line by integrated analysis of RNA-Seq and MS/MS data.

HeLa cell line, which was derived from cervical carcinoma, provides an idea platform to study both the integration of human papillomavirus and the massive mutations occurring on the cancer cell genome. Proteogenomics is a field with the intersection of proteomics and genomics to perform gene annotation and identify gene mutation. In this work, we first identified the SNV/INDEL, structural variation (SV), and virus infection/integration events from RNA-Seq data of HeLa cell line; then, by applying proteogenomics strategy, we were able to detect some of the genomic events with the tandem mass spectrometry (MS/MS) data from the same sample. Furthermore, some of the mutated peptides were experimentally validated using multiple reaction monitoring technology. The integrated analysis of the RNA-Seq and MS/MS data not only renders the discovery of HeLa cell genome variations more credible but also illustrates a practical workflow for protein-coding mutation discovery in cancer-related studies.

[1]  Jay Shendure,et al.  The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line , 2013, Nature.

[2]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[3]  Joshua E. Elias,et al.  Target-Decoy Search Strategy for Mass Spectrometry-Based Proteomics , 2010, Proteome Bioinformatics.

[4]  V. Bafna,et al.  Proteogenomics to discover the full coding content of genomes: a computational perspective. , 2010, Journal of proteomics.

[5]  Martin Kircher,et al.  Deep proteome and transcriptome mapping of a human cancer cell line , 2011, Molecular systems biology.

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  W. Pao,et al.  A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics* , 2011, Molecular & Cellular Proteomics.

[8]  G. Rubin,et al.  Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Yixue Li,et al.  Identification of gene fusions from human lung cancer mass spectrometry data , 2013, BMC Genomics.

[10]  Alexander Stojadinovic,et al.  Differential expression of colon cancer associated transcript1 (CCAT1) along the colonic adenoma-carcinoma sequence , 2013, BMC Cancer.

[11]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[12]  Gennifer E. Merrihew,et al.  Proteogenomic database construction driven from large scale RNA-seq data. , 2014, Journal of proteome research.

[13]  Yixue Li,et al.  Integration of mass spectrometry and RNA‐Seq data to confirm human ab initio predicted genes and lncRNAs , 2014, Proteomics.

[14]  Angela M. Liu,et al.  Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma , 2012, Nature Genetics.

[15]  Steven J. M. Jones,et al.  Comprehensive molecular characterization of human colon and rectal cancer , 2012, Nature.

[16]  Nicolas Stransky,et al.  Frequent genomic structural alterations at HPV insertion sites in cervical carcinoma , 2010, The Journal of pathology.

[17]  Brian L. Frey,et al.  Discovery and Mass Spectrometric Analysis of Novel Splice-junction Peptides Using RNA-Seq* , 2013, Molecular & Cellular Proteomics.

[18]  Joel A. Kooren,et al.  A two‐step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies , 2013, Proteomics.

[19]  Weihua Jia,et al.  Overexpression of long noncoding RNA PCAT-1 is a novel biomarker of poor prognosis in patients with colorectal cancer , 2013, Medical Oncology.

[20]  R. Zeng,et al.  The discovery of novel protein-coding features in mouse genome based on mass spectrometry data. , 2011, Genomics.

[21]  Bing Zhang,et al.  Protein identification using customized protein sequence databases derived from RNA-Seq data. , 2012, Journal of proteome research.

[22]  Chen Chen,et al.  Screening of missing proteins in the human liver proteome by improved MRM-approach-based targeted proteomics. , 2014, Journal of proteome research.

[23]  James B. Brown,et al.  Long noncoding RNAs are rarely translated in two human cell lines , 2012, Genome research.

[24]  Wen Gao,et al.  pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[25]  Robert Gentleman,et al.  Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer , 2012, Nature Genetics.

[26]  Gerd Ritter,et al.  Colon cancer associated transcript‐1: A novel RNA expressed in malignant and pre‐malignant human tissues , 2012, International journal of cancer.

[27]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[28]  Francesca Demichelis,et al.  PCAT-1, a long noncoding RNA, regulates BRCA2 and controls homologous recombination in cancer. , 2014, Cancer research.

[29]  Mario Cáceres,et al.  A recurrent inversion on the eutherian X chromosome , 2007, Proceedings of the National Academy of Sciences.

[30]  M. Hoeckel,et al.  A comprehensive analysis of HPV integration loci in anogenital lesions combining transcript and genome-based amplification techniques , 2003, Oncogene.

[31]  Detlef D. Leipe,et al.  National Center for Biotechnology Information Viral Genomes Project , 2004, Journal of Virology.

[32]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[33]  John T. Wei,et al.  Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression , 2011, Nature Biotechnology.

[34]  Jeffrey R. Whiteaker,et al.  Proteogenomic characterization of human colon and rectal cancer , 2014, Nature.

[35]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[36]  S. Corden,et al.  The integration of HPV-18 DNA in cervical carcinoma. , 1999, Molecular pathology : MP.

[37]  Lars Jansen,et al.  Non-Random Integration of the HPV Genome in Cervical Cancer , 2012, PloS one.

[38]  S. Salzberg,et al.  TopHat-Fusion: an algorithm for discovery of novel fusion transcripts , 2011, Genome Biology.

[39]  Feng Yang,et al.  Long noncoding RNA CCAT1, which could be activated by c-Myc, promotes the progression of gastric carcinoma , 2013, Journal of Cancer Research and Clinical Oncology.

[40]  B. Thiede,et al.  High Resolution Quantitative Proteomics of HeLa Cells Protein Species Using Stable Isotope Labeling with Amino Acids in Cell Culture(SILAC), Two-Dimensional Gel Electrophoresis(2DE) and Nano-Liquid Chromatograpohy Coupled to an LTQ-OrbitrapMass Spectrometer* , 2012, Molecular & Cellular Proteomics.

[41]  Paul Theodor Pyl,et al.  The Genomic and Transcriptomic Landscape of a HeLa Cell Line , 2013, G3: Genes, Genomes, Genetics.

[42]  Céline Hernandez,et al.  Database construction and peptide identification strategies for proteogenomic studies on sequenced genomes. , 2014, Current topics in medicinal chemistry.

[43]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[44]  D. DiMaio,et al.  Repression of human papillomavirus oncogenes in HeLa cervical carcinoma cells causes the orderly reactivation of dormant tumor suppressor pathways. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[45]  P. Townsend,et al.  The shotgun proteomic study of the human ThinPrep cervical smear using iTRAQ mass-tagging and 2D LC-FT-Orbitrap-MS: the detection of the human papillomavirus at the protein level. , 2013, Journal of proteome research.