Model selection for robust learning of mutational signatures using Negative Binomial non-negative matrix factorization

The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, but they are often overdispersed, and thus the Negative Binomial distribution is more appropriate. We demonstrate using a simulation study that Negative Binomial NMF requires fewer signatures than Poisson NMF to fit the data and we propose a Negative Binomial NMF with a patient specific overdispersion parameter to capture the variation across patients. We also introduce a robust model selection procedure inspired by cross-validation to determine the number of signatures. Furthermore we study the influence of the distributional assumption in relation to two classical model selection procedures: the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). In the presence of overdispersion we show that our model selection procedure is more robust at determining the correct number of signatures than state-of-the-art methods, which are overestimating the number of signatures. We apply our proposed analysis on a wide range of simulated data and on a data set from breast cancer patients. The code for our algorithms and analysis is available in the R package SigMoS and can be found at https://github.com/ MartaPelizzola/SigMoS .

[1]  Gunnar Rätsch,et al.  Mutational signature learning with supervised negative binomial non-negative matrix factorization , 2020, Bioinform..

[2]  Jing Zhang,et al.  NIMBus: a negative binomial regression based Integrative Method for mutation Burden Analysis , 2020, BMC Bioinformatics.

[3]  Robert Tibshirani,et al.  De novo mutational signature discovery in tumor genomes using SparseSignatures , 2018, bioRxiv.

[4]  Thomas Oberlin,et al.  Negative Binomial Matrix Factorization , 2020, IEEE Signal Processing Letters.

[5]  Ville Mustonen,et al.  The repertoire of mutational signatures in human cancer , 2018, Nature.

[6]  Swagatam Das,et al.  Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering , 2018, Pattern Recognit. Lett..

[7]  M. Stratton,et al.  Universal Patterns of Selection in Cancer and Somatic Tissues , 2018, Cell.

[8]  Scott R. Kennedy,et al.  Aging and the rise of somatic cancer-associated mutations in normal tissues , 2018, PLoS genetics.

[9]  Atsushi Shibai,et al.  Mutation accumulation under UV radiation in Escherichia coli , 2017, Scientific Reports.

[10]  Yong Luo,et al.  Performances of LOO and WAIC as IRT Model Selection Methods , 2017 .

[11]  Rafael Rosales,et al.  signeR: an empirical Bayesian approach to mutational signature discovery , 2017, Bioinform..

[12]  R. Verity,et al.  Estimating the Number of Subpopulations (K) in Structured Populations , 2016, Genetics.

[13]  M. Stratton,et al.  Mutational signatures associated with tobacco smoking in human cancer , 2016, Science.

[14]  M. Gerstein,et al.  LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations , 2015, Nucleic acids research.

[15]  K. Teerapabolarn NEGATIVE BINOMIAL APPROXIMATION TO THE BETA BINOMIAL DISTRIBUTION , 2015 .

[16]  C. Sander,et al.  Genome-wide analysis of non-coding regulatory mutations in cancer , 2014, Nature Genetics.

[17]  Aki Vehtari,et al.  Understanding predictive information criteria for Bayesian models , 2013, Statistics and Computing.

[18]  P. Campbell,et al.  EMu: probabilistic inference of mutational processes and their localization in the cancer genome , 2013, Genome Biology.

[19]  M. Stratton,et al.  Deciphering Signatures of Mutational Processes Operative in Human Cancer , 2013, Cell reports.

[20]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[21]  Haesun Park,et al.  Fast bregman divergence NMF using taylor expansion and coordinate descent , 2012, KDD.

[22]  A. Børresen-Dale,et al.  Mutational Processes Molding the Genomes of 21 Breast Cancers , 2012, Cell.

[23]  C. Cole,et al.  COSMIC: the catalogue of somatic mutations in cancer , 2011, Genome Biology.

[24]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[25]  C. Févotte,et al.  Automatic Relevance Determination in Nonnegative Matrix Factorization with the-Divergence , 2011 .

[26]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[27]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[28]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.