Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data

We present a new non-negative matrix factorization model for (0, 1) bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently general that it can be adapted to many other domains where latent representations of (0, 1) bounded-support data are of interest. Although the DNCB distribution lacks a closed-form conjugate prior, several augmentations let us derive an efficient posterior inference algorithm composed entirely of analytic updates. Our model improves out-of-sample predictive performance on both real and synthetic DNA methylation datasets over stateof-the-art methods in bioinformatics. In addition, our model yields meaningful latent representations that accord with existing biological knowledge.

[1]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[2]  A. Ongaro,et al.  Some results on non-central beta distributions , 2015 .

[3]  Honggang Zhang,et al.  Variational Bayesian Matrix Factorization for Bounded Support Data , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  David B. Dunson,et al.  Lognormal and Gamma Mixed Negative Binomial Regression , 2012, ICML.

[5]  Michalis K. Titsias,et al.  The Infinite Gamma-Poisson Feature Model , 2007, NIPS.

[6]  Kaichun Wu,et al.  Genistein suppresses FLT4 and inhibits human colorectal cancer metastasis , 2014, Oncotarget.

[7]  Michael J. Freedman,et al.  Scalable Inference of Overlapping Communities , 2012, NIPS.

[8]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[9]  Andrew E. Teschendorff,et al.  A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform , 2012, BMC Bioinformatics.

[10]  N. Yousif Fibronectin promotes migration and invasion of ovarian cancer cells through up‐regulation of FAK–PI3K/Akt pathway , 2014, Cell biology international.

[11]  G. Fan,et al.  DNA Methylation and Its Basic Function , 2013, Neuropsychopharmacology.

[12]  Margaret R. Karagas,et al.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions , 2008, BMC Bioinformatics.

[13]  Zhiwei Steven Wu,et al.  Locally Private Bayesian Inference for Count Models , 2018, ICML.

[14]  P. W. Karlsson,et al.  Multiple Gaussian hypergeometric series , 1985 .

[15]  C. Orsi New insights into non-central beta distributions , 2017, 1706.08557.

[16]  Xin Zhou,et al.  A statistical framework for Illumina DNA methylation arrays , 2010, Bioinform..

[17]  Scott W. Linderman,et al.  Poisson-Randomized Gamma Dynamical Systems , 2019, NeurIPS.

[18]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Nathan C. Sheffield,et al.  DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma , 2017, Nature Medicine.

[21]  Pierre-Antoine Absil,et al.  Elucidating the Altered Transcriptional Programs in Breast Cancer using Independent Component Analysis , 2007, PLoS Comput. Biol..

[22]  W. J. Hall,et al.  ON CHARACTERIZATION OF THE GAMMA DISTRIBUTION. , 1968 .

[23]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[24]  Aki Vehtari,et al.  Understanding predictive information criteria for Bayesian models , 2013, Statistics and Computing.

[25]  Ali Taylan Cemgil,et al.  Bayesian Inference for Nonnegative Matrix Factorisation Models , 2009, Comput. Intell. Neurosci..

[26]  Daniel Fink A Compendium of Conjugate Priors , 1997 .

[27]  Jalil Taghia,et al.  Comparisons of Non-Gaussian Statistical Models in DNA Methylation Analysis , 2014, International journal of molecular sciences.

[28]  W. Hong,et al.  Rab34 regulates adhesion, migration, and invasion of breast cancer cells , 2018, Oncogene.

[29]  J. Kalbfleisch,et al.  On the Bessel Distribution and Related Problems , 2000 .

[30]  L. Devroye SIMULATING BESSEL RANDOM VARIABLES , 2002 .