Mut2Vec: distributed representation of cancerous mutations

BackgroundEmbedding techniques for converting high-dimensional sparse data into low-dimensional distributed representations have been gaining popularity in various fields of research. In deep learning models, embedding is commonly used and proven to be more effective than naive binary representation. However, yet no attempt has been made to embed highly sparse mutation profiles into densely distributed representations. Since binary representation does not capture biological context, its use is limited in many applications such as discovering novel driver mutations. Additionally, training distributed representations of mutations is challenging due to a relatively small amount of available biological data compared with the large amount of text corpus data in text mining fields.MethodsWe introduce Mut2Vec, a novel computational pipeline that can be used to create a distributed representation of cancerous mutations. Mut2Vec is trained on cancer profiles using Skip-Gram since cancer can be characterized by a series of co-occurring mutations. We also augmented our pipeline with existing information in the biomedical literature and protein-protein interaction networks to compensate for the data insufficiency.ResultsTo evaluate our models, we conducted two experiments that involved the following tasks: a) visualizing driver and passenger mutations, b) identifying novel driver mutations using a clustering method. Our visualization showed a clear distinction between passenger mutations and driver mutations. We also found driver mutation candidates and proved that these were true driver mutations based on our literature survey. The pre-trained mutation vectors and the candidate driver mutations are publicly available at http://infos.korea.ac.kr/mut2vec.ConclusionsWe introduce Mut2Vec that can be utilized to generate distributed representations of mutations and experimentally validate the efficacy of the generated mutation representations. Mut2Vec can be used in various deep learning applications such as cancer classification and drug sensitivity prediction.

[1]  Hyunseok Kang,et al.  Whole-Exome Sequencing of Salivary Gland Mucoepidermoid Carcinoma , 2016, Clinical Cancer Research.

[2]  M. Gerstein,et al.  LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations , 2015, Nucleic acids research.

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Jessica Zucman-Rossi,et al.  Recurrent AAV2-related insertional mutagenesis in human hepatocellular carcinomas , 2015, Nature Genetics.

[5]  Luigi Buonaguro,et al.  Frequency and geographic distribution of TERT promoter mutations in primary hepatocellular carcinoma , 2017, Infectious Agents and Cancer.

[6]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[7]  L. Hetterschijt,et al.  The renal cell carcinoma-associated oncogenic fusion protein PRCCTFE3 provokes p21 WAF1/CIP1-mediated cell cycle delay. , 2009, Experimental cell research.

[8]  Lijing Yao,et al.  Inferring regulatory element landscapes and transcription factor networks from cancer methylomes , 2015, Genome Biology.

[9]  Travis M. Drucker,et al.  Integrated Genomic Analysis of Pancreatic Ductal Adenocarcinomas Reveals Genomic Rearrangement Events as Significant Drivers of Disease. , 2016, Cancer research.

[10]  Bin Xu,et al.  Primary Thyroid Carcinoma with Low-Risk Histology and Distant Metastases: Clinicopathologic and Molecular Characteristics. , 2017, Thyroid : official journal of the American Thyroid Association.

[11]  T. Gonda,et al.  Adenoid Cystic Carcinoma Can Be Driven by MYB or MYBL1 Rearrangements: New Insights into MYB and Tumor Biology. , 2016, Cancer discovery.

[12]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[13]  M. Ladanyi,et al.  The EWSR1/NR4A3 fusion protein of extraskeletal myxoid chondrosarcoma activates the PPARG nuclear receptor gene , 2009, The Journal of pathology.

[14]  Léon Personnaz,et al.  Enrichment or depletion of a GO category within a class of genes: which test? , 2007, Bioinform..

[15]  G. Gores,et al.  Emerging molecular therapeutic targets for cholangiocarcinoma. , 2017, Journal of hepatology.

[16]  Stefan Wiemann,et al.  Recurrent Somatic PDGFRB Mutations in Sporadic Infantile/Solitary Adult Myofibromas But Not in Angioleiomyomas and Myopericytomas , 2017, The American journal of surgical pathology.

[17]  Xiaoping Liu,et al.  Genomic and Epigenomic Heterogeneity of Hepatocellular Carcinoma. , 2017, Cancer research.

[18]  Michael P. Schroeder,et al.  IntOGen-mutations identifies cancer drivers across tumor types , 2013, Nature Methods.

[19]  Andrea M. Olofson,et al.  Primary Intraprostatic Synovial Sarcoma. , 2017, Archives of pathology & laboratory medicine.

[20]  Stefan K. Zöllner,et al.  Synovial sarcoma is a gateway to the role of chromatin remodeling in cancer , 2015, Cancer and Metastasis Reviews.

[21]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[22]  Andrei V. Gudkov,et al.  CCI-007, a novel small molecule with cytotoxic activity against infant leukemia with MLL rearrangements , 2016, Oncotarget.

[23]  Jessica Zucman-Rossi,et al.  Adeno-associated virus type 2 as an oncogenic virus in human hepatocellular carcinoma , 2016, Molecular & cellular oncology.

[24]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[25]  Leonardo G. Trabuco,et al.  Negative protein-protein interaction datasets derived from large-scale two-hybrid experiments. , 2012, Methods.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  G. Assié,et al.  ENDOCRINE TUMOURS: The genomics of adrenocortical tumors. , 2016, European journal of endocrinology.

[28]  Ralf Kittler,et al.  The ubiquitin ligase TRIM25 targets ERG for degradation in prostate cancer , 2016, Oncotarget.

[29]  Takuro Nakamura,et al.  Identification of target genes for EWS/ATF-1 chimeric transcription factor , 2003, Oncogene.

[30]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  Guoyan Zhao,et al.  A modular system of DNA enhancer elements mediates tissue-specific activation of transcription by high dietary zinc in C. elegans , 2014, Nucleic acids research.

[33]  W. Gerald,et al.  The EWS-WT1 gene fusion in desmoplastic small round cell tumor. , 2005, Seminars in cancer biology.

[34]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[35]  Zhen Li,et al.  Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks , 2016, IJCAI.

[36]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[37]  Gang Wang,et al.  RC-NET: A General Framework for Incorporating Knowledge into Word Representations , 2014, CIKM.

[38]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[39]  James M. Hogan,et al.  Distributed Representations for Biological Sequence Analysis , 2016, ArXiv.

[40]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[41]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[42]  T. Redmer,et al.  The role of the cancer stem cell marker CD271 in DNA damage response and drug resistance of melanoma cells , 2017, Oncogenesis.

[43]  Andreu Alibés,et al.  MYCL is a target of a BET bromodomain inhibitor, JQ1, on growth suppression efficacy in small cell lung cancer cells , 2016, Oncotarget.

[44]  Liang Cheng,et al.  TMPRSS2‐ERG gene fusion is rare compared to PTEN deletions in stage T1a prostate cancer , 2016, Molecular carcinogenesis.

[45]  Stein Aerts,et al.  Comprehensive Analysis of Transcriptome Variation Uncovers Known and Novel Driver Events in T-Cell Acute Lymphoblastic Leukemia , 2013, PLoS genetics.

[46]  Ruoning Wang,et al.  MYC in Regulating Immunity: Metabolism and Beyond , 2017, Genes.

[47]  Torsten O. Nielsen,et al.  Identification of cytotoxic agents disrupting synovial sarcoma oncoprotein interactions by proximity ligation assay , 2016, Oncotarget.

[48]  Maged El-Setouhy,et al.  Exome sequencing of oral squamous cell carcinoma in users of Arabian snuff reveals novel candidates for driver genes , 2016, International journal of cancer.

[49]  Yoshua Bengio,et al.  Diet Networks: Thin Parameters for Fat Genomic , 2016, ICLR.

[50]  Andrew D. Rouillard,et al.  Enrichr: a comprehensive gene set enrichment analysis web server 2016 update , 2016, Nucleic Acids Res..

[51]  Barbara Heidenreich,et al.  Altered TERT promoter and other genomic regulatory elements: occurrence and impact , 2017, International journal of cancer.

[52]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[53]  Takashi Akasaka,et al.  Deregulated expression of cytokine receptor gene, CRLF2, is involved in lymphoid transformation in B-cell precursor acute lymphoblastic leukemia. , 2009, Blood.

[54]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[55]  Boris Sepesi,et al.  Clinicopathologic and genetic features of primary bronchopulmonary mucoepidermoid carcinoma: the MD Anderson Cancer Center experience and comprehensive review of the literature , 2017, Virchows Archiv.

[56]  Bin Xu,et al.  Genomic Landscape of poorly Differentiated and Anaplastic Thyroid Carcinoma , 2016, Endocrine Pathology.

[57]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[58]  Huining Kang,et al.  Recurrent Fusions in MYB and MYBL1 Define a Common, Transcription Factor-Driven Oncogenic Pathway in Salivary Gland Adenoid Cystic Carcinoma. , 2016, Cancer discovery.

[59]  T. Beilharz,et al.  Epitope-tagged yeast strains reveal promoter driven changes to 3′-end formation and convergent antisense-transcription from common 3′ UTRs , 2015, Nucleic acids research.

[60]  Marc Ladanyi,et al.  Synovial sarcoma: recent discoveries as a roadmap to new avenues for therapy. , 2015, Cancer discovery.

[61]  Li Ding,et al.  The impact of chromosomal translocation locus and fusion oncogene coding sequence in synovial sarcomagenesis , 2016, Oncogene.

[62]  XuBin,et al.  Primary Thyroid Carcinoma with Low-Risk Histology and Distant Metastases: Clinicopathologic and Molecular Characteristics. , 2017 .

[63]  C. Pecquet,et al.  Mining for JAK-STAT mutations in cancer. , 2008, Trends in biochemical sciences.

[64]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[65]  Akihiro Umezawa,et al.  Upregulation of Id2, an oncogenic helix-loop-helix protein, is mediated by the chimeric EWS/ets protein in Ewing sarcoma , 2003, Oncogene.

[66]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[67]  K. Savage,et al.  The biology and management of systemic anaplastic large cell lymphoma. , 2015, Blood.

[68]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[69]  Julie Dragon,et al.  RAD51 variant proteins from human lung and kidney tumors exhibit DNA strand exchange defects. , 2016, DNA repair.

[70]  L. Groussin,et al.  Molecular perspectives in differentiated thyroid cancer. , 2015, Annales d'endocrinologie.

[71]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[72]  Johan Hansson,et al.  TERT promoter mutations in sinonasal malignant melanoma: a study of 49 cases , 2015, Melanoma research.

[73]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[74]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.