A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures

Recent advances in sequencing technologies have enabled the production of massive amounts of data on somatic mutations from cancer genomes. These data have led to the detection of characteristic patterns of somatic mutations or “mutation signatures” at an unprecedented resolution, with the potential for new insights into the causes and mechanisms of tumorigenesis. Here we present new methods for modelling, identifying and visualizing such mutation signatures. Our methods greatly simplify mutation signature models compared with existing approaches, reducing the number of parameters by orders of magnitude even while increasing the contextual factors (e.g. the number of flanking bases) that are accounted for. This improves both sensitivity and robustness of inferred signatures. We also provide a new intuitive way to visualize the signatures, analogous to the use of sequence logos to visualize transcription factor binding sites. We illustrate our new method on somatic mutation data from urothelial carcinoma of the upper urinary tract, and a larger dataset from 30 diverse cancer types. The results illustrate several important features of our methods, including the ability of our new visualization tool to clearly highlight the key features of each signature, the improved robustness of signature inferences from small sample sizes, and more detailed inference of signature characteristics such as strand biases and sequence context effects at the base two positions 5′ to the mutated site. The overall framework of our work is based on probabilistic models that are closely connected with “mixed-membership models” which are widely used in population genetic admixture analysis, and in machine learning for document clustering. We argue that recognizing these relationships should help improve understanding of mutation signature extraction problems, and suggests ways to further improve the statistical methods. Our methods are implemented in an R package pmsignature (https://github.com/friend1ws/pmsignature) and a web application available at https://friend1ws.shinyapps.io/pmsignature_shiny/.

[1]  A. Rényi On Measures of Entropy and Information , 1961 .

[2]  H. Akaike A new look at the statistical model identification , 1974 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[5]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[6]  M Krawczak,et al.  Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. , 1998, American journal of human genetics.

[7]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[8]  N. Tretyakova,et al.  Tobacco smoke carcinogens, DNA damage and p53 mutations in smoking-associated cancers , 2002, Oncogene.

[9]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Terence P. Speed,et al.  Finding short DNA motifs using permuted markov models , 2004, RECOMB.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[14]  G. Pfeifer,et al.  Mutations induced by ultraviolet light. , 2005, Mutation research.

[15]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[16]  Terence P. Speed,et al.  Finding Short DNA Motifs Using Permuted Markov Models , 2005, J. Comput. Biol..

[17]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[18]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[19]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[20]  R. Varadhan,et al.  Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm , 2008 .

[21]  Chris H. Q. Ding,et al.  On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing , 2008, Comput. Stat. Data Anal..

[22]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[23]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[24]  S. Gabriel,et al.  Advances in understanding cancer genomes through second-generation sequencing , 2010, Nature Reviews Genetics.

[25]  M. Stephens,et al.  Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis , 2010, PLoS genetics.

[26]  Kenneth Lange,et al.  Enhancements to the ADMIXTURE algorithm for individual ancestry estimation , 2011, BMC Bioinformatics.

[27]  Ning Li,et al.  A new non-negative matrix factorization algorithm with sparseness constraints , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[28]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[29]  Hua Zhou,et al.  A quasi-Newton acceleration for high-dimensional optimization algorithms , 2011, Stat. Comput..

[30]  B. Schuster-Böckler,et al.  Chromatin organization is a major influence on regional mutation rates in human cancer cells , 2012, Nature.

[31]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[32]  Alan Hodgkinson,et al.  The large‐scale distribution of somatic mutations in cancer genomes , 2012, Human mutation.

[33]  Ryan P. Adams,et al.  Priors for Diversity in Generative Latent Variable Models , 2012, NIPS.

[34]  A. Børresen-Dale,et al.  Mutational Processes Molding the Genomes of 21 Breast Cancers , 2012, Cell.

[35]  Steven A. Roberts,et al.  An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers , 2013, Nature Genetics.

[36]  Davide Cittaro,et al.  Genome-wide mapping of human DNA-replication origins: Levels of transcription at ORC1 sites regulate origin selection and replication timing , 2012, Genome research.

[37]  Matthew Stephens,et al.  Variational Inference of Population Structure in Large SNP Datasets , 2013, bioRxiv.

[38]  K. Kinzler,et al.  Mutational Signature of Aristolochic Acid Exposure as Revealed by Whole-Exome Sequencing , 2013, Science Translational Medicine.

[39]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[40]  P. Campbell,et al.  EMu: probabilistic inference of mutational processes and their localization in the cancer genome , 2013, Genome Biology.

[41]  Trevor J Pugh,et al.  Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation , 2013, Nucleic acids research.

[42]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[43]  N. A. Temiz,et al.  Evidence for APOBEC3B mutagenesis in multiple human cancers , 2013, Nature Genetics.

[44]  N. A. Temiz,et al.  APOBEC3B is an enzymatic source of mutation in breast cancer , 2013, Nature.

[45]  S. De,et al.  DNA replication timing and higher-order nuclear organization determine single nucleotide substitution patterns in cancer genomes , 2013, Nature Communications.

[46]  M. Stratton,et al.  Deciphering Signatures of Mutational Processes Operative in Human Cancer , 2013, Cell reports.

[47]  Serena Nik-Zainal,et al.  Mechanisms underlying mutational signatures in human cancers , 2014, Nature Reviews Genetics.

[48]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.

[49]  Chris Sander,et al.  Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication , 2014, Genome research.

[50]  Hiromi Nakamura,et al.  Trans-ancestry mutational landscape of hepatocellular carcinoma genomes , 2014, Nature Genetics.

[51]  G. Crooks On Measures of Entropy and Information , 2015 .

[52]  Paz Polak,et al.  Cell-of-origin chromatin organization shapes the mutational landscape of cancer , 2015, Nature.

[53]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.