Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations

Low rank matrix approximation is an important tool in machine learning. Given a data matrix, low rank approximation helps to find factors, patterns and provides concise representations for the data. Research on low rank approximation usually focus on real matrices. However, in many applications data are binary (categorical) rather than continuous. This leads to the problem of low rank approximation of binary matrix. Here we are given a $d \times n$ binary matrix $A$ and a small integer $k$. The goal is to find two binary matrices $U$ and $V$ of sizes $d \times k$ and $k \times n$ respectively, so that the Frobenius norm of $A - U V$ is minimized. There are two models of this problem, depending on the definition of the dot product of binary vectors: The $\mathrm{GF}(2)$ model and the Boolean semiring model. Unlike low rank approximation of real matrix which can be efficiently solved by Singular Value Decomposition, approximation of binary matrix is $NP$-hard even for $k=1$. In this paper, we consider the problem of Column Subset Selection (CSS), in which one low rank matrix must be formed by $k$ columns of the data matrix. We characterize the approximation ratio of CSS for binary matrices. For $GF(2)$ model, we show the approximation ratio of CSS is bounded by $\frac{k}{2}+1+\frac{k}{2(2^k-1)}$ and this bound is asymptotically tight. For Boolean model, it turns out that CSS is no longer sufficient to obtain a bound. We then develop a Generalized CSS (GCSS) procedure in which the columns of one low rank matrix are generated from Boolean formulas operating bitwise on columns of the data matrix. We show the approximation ratio of GCSS is bounded by $2^{k-1}+1$, and the exponential dependency on $k$ is inherent.

[1]  Gene H. Golub,et al.  Numerical methods for solving linear least squares problems , 1965, Milestones in Matrix Computation.

[2]  C. Pan,et al.  Rank-Revealing QR Factorizations and the Singular Value Decomposition , 1992 .

[3]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[4]  Heikki Mannila,et al.  A Simple Algorithm for Topic Identification in 0-1 Data , 2003, PKDD.

[5]  Kari Karhunen,et al.  Über lineare Methoden in der Wahrscheinlichkeitsrechnung , 1947 .

[6]  Christos Boutsidis,et al.  Near Optimal Column-Based Matrix Reconstruction , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[7]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[8]  Aditya Bhaskara,et al.  Greedy Column Subset Selection: New Bounds and Distributed Algorithms , 2016, ICML.

[9]  Sjsu ScholarWorks,et al.  Rank revealing QR factorizations , 2014 .

[10]  Vilém Vychodil,et al.  Discovery of optimal factors in binary data via a novel method of matrix decomposition , 2010, J. Comput. Syst. Sci..

[11]  D. Hochbaum,et al.  Forest Harvesting and Minimum Cuts: A New Approach to Handling Spatial Constraints , 1997 .

[12]  Jinsong Tan,et al.  Inapproximability of Maximum Weighted Edge Biclique and Its Applications , 2007, TAMC.

[13]  Salvatore Orlando,et al.  Mining Top-K Patterns from Binary Datasets in Presence of Noise , 2010, SDM.

[14]  Leslie G. Valiant,et al.  Graph-Theoretic Arguments in Low-Level Complexity , 1977, MFCS.

[15]  Rong Jin,et al.  An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection , 2015, ICML.

[16]  Saharon Rosset,et al.  Generalized Independent Component Analysis Over Finite Alphabets , 2016, IEEE Trans. Inf. Theory.

[17]  Christian H. Bischof,et al.  Computing rank-revealing QR factorizations of dense matrices , 1998, TOMS.

[18]  Daniël Paulusma,et al.  Covering graphs with few complete bipartite subgraphs , 2009, Theor. Comput. Sci..

[19]  Jieping Ye,et al.  Mining discrete patterns via binary matrix factorization , 2009, KDD.

[20]  C. Pan On the existence and computation of rank-revealing LU factorizations , 2000 .

[21]  Joachim M. Buhmann,et al.  Multi-assignment clustering for Boolean data , 2009, ICML '09.

[22]  Malik Magdon-Ismail,et al.  Column subset selection via sparse approximation of SVD , 2012, Theor. Comput. Sci..

[23]  Aarti Singh,et al.  Column Subset Selection with Missing Data via Active Sampling , 2015, AISTATS.

[24]  Ilse C. F. Ipsen,et al.  On Rank-Revealing Factorisations , 1994, SIAM J. Matrix Anal. Appl..

[25]  Christos Boutsidis,et al.  An improved approximation algorithm for the column subset selection problem , 2008, SODA.

[26]  L. Foster Rank and null space calculations using matrix decomposition without column interchanges , 1986 .

[27]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[28]  Per Christian Hansen,et al.  Low-rank revealing QR factorizations , 1994, Numerical Linear Algebra with Applications.

[29]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[30]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[31]  Alexander A. Frolov,et al.  Boolean Factor Analysis by Attractor Neural Network , 2007, IEEE Transactions on Neural Networks.

[32]  Ron M. Roth,et al.  On the Hardness of Decoding the Gale–Berlekamp Code , 2007, IEEE Transactions on Information Theory.

[33]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[34]  Luis Rademacher,et al.  Efficient Volume Sampling for Row/Column Subset Selection , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[35]  Peng Jiang,et al.  A Clustering Approach to Constrained Binary Matrix Factorization , 2014 .

[36]  Joel A. Tropp,et al.  Column subset selection, matrix factorization, and eigenvalue optimization , 2008, SODA.

[37]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[38]  Nicolas Gillis,et al.  On the Complexity of Robust PCA and ℓ1-norm Low-Rank Matrix Approximation , 2015, Math. Oper. Res..

[39]  Milos Hauskrecht,et al.  Noisy-OR Component Analysis and its Application to Link Analysis , 2006, J. Mach. Learn. Res..

[40]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[41]  Arie Yeredor,et al.  ICA over finite fields - Separability and algorithms , 2012, Signal Process..

[42]  Beverly Sackler,et al.  The Bicluster Graph Editing Problem , 2004 .

[43]  Arie Yeredor,et al.  Independent Component Analysis Over Galois Fields of Prime Order , 2011, IEEE Transactions on Information Theory.

[44]  ŠingliarTomáš,et al.  Noisy-OR Component Analysis and its Application to Link Analysis , 2006 .

[45]  Kristoffer Arnsfelt Hansen,et al.  On Low Rank Approximation of Binary Matrices , 2015, ArXiv.

[46]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[47]  Vijayalakshmi Atluri,et al.  The role mining problem: finding a minimal descriptive set of roles , 2007, SACMAT '07.