FastStep: Scalable Boolean Matrix Decomposition

Matrix Decomposition methods are applied to a wide range of tasks, such as data denoising, dimensionality reduction, co-clustering and community detection. However, in the presence of boolean inputs, common methods either do not scale or do not provide a boolean reconstruction, which results in high reconstruction error and low interpretability of the decomposition. We propose a novel step decomposition of boolean matrices in non-negative factors with boolean reconstruction. By formulating the problem using threshold operators and through suitable relaxation of this problem, we provide a scalable algorithm that can be applied to boolean matrices with millions of non-zero entries. We show that our method achieves significantly lower reconstruction error when compared to standard state of the art algorithms. We also show that the decomposition keeps its interpretability by analyzing communities in a flights dataset (where the matrix is interpreted as a graph in which nodes are airports) and in a movie-ratings dataset with 10 million non-zeros.

[1]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[2]  H. Akaike A new look at the statistical model identification , 1974 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[5]  Lawrence K. Saul,et al.  A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[6]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[7]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[8]  Roded Sharan,et al.  Biclustering Algorithms: A Survey , 2007 .

[9]  Jimeng Sun,et al.  Less is More: Compact Matrix Decomposition for Large Sparse Graphs , 2007, SDM.

[10]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[11]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[12]  Chris H. Q. Ding,et al.  Binary matrix factorization for analyzing gene expression data , 2009, Data Mining and Knowledge Discovery.

[13]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[14]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[15]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[16]  Anastasios Kyrillidis,et al.  Improving Co-Cluster Quality with Application to Product Recommendations , 2014, CIKM.

[17]  Christos Faloutsos,et al.  Beyond Blocks: Hyperbolic Community Detection , 2014, ECML/PKDD.