A Sampling Algorithm to Compute the Set of Feasible Solutions for NonNegative Matrix Factorization with an Arbitrary Rank

Non-negative Matrix Factorization (NMF) is a useful method to extract features from multivariate data, but an important and sometimes neglected concern is that NMF can result in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), which makes it more difficult to interpret the factorization. This problem is especially ignored in cancer genomics, where NMF is used to infer information about the mutational processes present in the evolution of cancer. In this paper the extent of non-uniqueness is investigated for two mutational counts data, and a new sampling algorithm, that can find the SFS, is introduced. Our sampling algorithm is easy to implement and applies to an arbitrary rank of NMF. This is in contrast to state of the art, where the NMF rank must be smaller than or equal to four. For lower ranks we show that our algorithm performs similarly to the polygon inflation algorithm that is developed in relations to chemometrics. Furthermore, we show how the size of the SFS can have a high influence on the appearing variability of a solution. Our sampling algorithm is implemented in an R package SFS (https://github. com/ragnhildlaursen/SFS).

[1]  M. Maeder,et al.  Resolving factor analysis. , 2001, Analytical chemistry.

[2]  M. Stratton,et al.  Deciphering Signatures of Mutational Processes Operative in Human Cancer , 2013, Cell reports.

[3]  Rafael Rosales,et al.  signeR: an empirical Bayesian approach to mutational signature discovery , 2017, Bioinform..

[4]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[5]  Klaus Neymeyr,et al.  A fast polygon inflation algorithm to compute the area of feasible solutions for three‐component systems. II: Theoretical foundation, inverse polygon inflation, and FAC‐PACK implementation , 2014 .

[6]  Mark D. Plumbley,et al.  Theorems on Positive Data: On the Uniqueness of NMF , 2008, Comput. Intell. Neurosci..

[7]  R. Henry,et al.  Extension of self-modeling curve resolution to mixtures of more than three components: Part 1. Finding the basic feasible region , 1990 .

[8]  Klaus Neymeyr,et al.  A fast polygon inflation algorithm to compute the area of feasible solutions for three‐component systems. I: concepts and applications , 2013 .

[9]  Nikos D. Sidiropoulos,et al.  Non-Negative Matrix Factorization Revisited: Uniqueness and Algorithm for Symmetric Decomposition , 2014, IEEE Transactions on Signal Processing.

[10]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[11]  P. Gemperline,et al.  Computation of the range of feasible solutions in self-modeling curve resolution algorithms. , 1999, Analytical chemistry.

[12]  Bruce R. Kowalski,et al.  An extension of the multivariate component-resolution method to three components , 1985 .

[13]  David T. W. Jones,et al.  Signatures of mutational processes in human cancer , 2013, Nature.

[14]  Klaus Neymeyr,et al.  On the Set of Solutions of the Nonnegative Matrix Factorization Problem , 2018, SIAM J. Matrix Anal. Appl..

[15]  P. Campbell,et al.  EMu: probabilistic inference of mutational processes and their localization in the cancer genome , 2013, Genome Biology.

[16]  M. Stephens,et al.  A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures , 2015, bioRxiv.

[17]  Victoria Stodden,et al.  When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts? , 2003, NIPS.

[18]  David Brie,et al.  Non-negative source separation: range of admissible solutions and conditions for the uniqueness of the solution , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[19]  Sandro Morganella,et al.  Mutational Signatures in Breast Cancer: The Problem at the DNA Level , 2017, Clinical Cancer Research.

[20]  E. A. Sylvestre,et al.  Self Modeling Curve Resolution , 1971 .

[21]  Klaus Neymeyr,et al.  A review of recent methods for the determination of ranges of feasible solutions resulting from soft modelling analyses of multivariate data. , 2016, Analytica chimica acta.