Bayesian Boolean Matrix Factorisation

Boolean matrix factorisation aims to decompose a binary data matrix into an approximate Boolean product of two low rank, binary matrices: one containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for Boolean matrix factorisation and derive a Metropolised Gibbs sampler that facilitates efficient parallel posterior inference. On real world and simulated data, our method outperforms all currently existing approaches for Boolean matrix factorisation and completion. This is the first method to provide full posterior inference for Boolean Matrix factorisation which is relevant in applications, e.g. for controlling false positive rates in collaborative filtering and, crucially, improves the interpretability of the inferred patterns. The proposed algorithm scales to large datasets as we demonstrate by analysing single cell gene expression data in 1.3 million mouse brain cells across 11 thousand genes on commodity hardware.

[1]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[2]  P. Brennan,et al.  Expression profiling reveals differential gene induction underlying specific and non-specific memory for pheromones in mice , 2011, Neurochemistry International.

[3]  Cole Trapnell,et al.  Defining cell types and states with single-cell genomics , 2015, Genome research.

[4]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[5]  David M. Blei,et al.  Deep Exponential Families , 2014, AISTATS.

[6]  Stephen R Quake,et al.  Dissecting genomic diversity, one cell at a time , 2013, Nature Methods.

[7]  M. Schachner,et al.  L1 and CHL1 Cooperate in Thalamocortical Axon Targeting. , 2011, Cerebral cortex.

[8]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[9]  Barnabás Póczos,et al.  Boolean Matrix Factorization and Noisy Completion via Message Passing , 2015, ICML.

[10]  Yi Liu,et al.  Hierarchical compositional feature learning , 2016, ArXiv.

[11]  M. Berry,et al.  Selenoprotein W expression and regulation in mouse brain and neurons , 2013, Brain and behavior.

[12]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Pauli Miettinen,et al.  MDL4BMF: Minimum Description Length for Boolean Matrix Factorization , 2014, TKDD.

[14]  P. Peskun,et al.  Optimum Monte-Carlo sampling using Markov chains , 1973 .

[15]  Thomas L. Griffiths,et al.  A Non-Parametric Bayesian Method for Inferring Hidden Causes , 2006, UAI.

[16]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[17]  Joachim M. Buhmann,et al.  Multi-assignment clustering for Boolean data , 2009, ICML '09.

[18]  Xiaorui Cheng,et al.  Gene expression patterns of hippocampus and cerebral cortex of senescence-accelerated mouse treated with Huang-Lian-Jie-Du decoction , 2008, Neuroscience Letters.

[19]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  O. Marín,et al.  Robo1 and Robo2 Cooperate to Control the Guidance of Major Axonal Tracts in the Mammalian Forebrain , 2007, The Journal of Neuroscience.

[21]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[22]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[23]  Zoubin Ghahramani,et al.  Modeling Dyadic Data with Binary Latent Factors , 2006, NIPS.

[24]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jun S. Liu Peskun's theorem and a modified discrete-state Gibbs sampler , 1996 .