Weighted Rank-One Binary Matrix Factorization

Mining discrete patterns in binary data is important for many data analysis tasks, such as data sampling, compression, and clustering. An example is that replacing individual records with their patterns would greatly reduce data size and simplify subsequent data analysis tasks. As a straightforward approach, rank-one binary matrix approximation has been actively studied recently for mining discrete patterns from binary data. It factorizes a binary matrix into the multiplication of one binary pattern vector and one binary presence vector, while minimizing mismatching entries. However, this approach suffers from two serious problems. First, if all records are replaced with their respective patterns, the noise could make as much as 50% in the resulting approximate data. This is because the approach simply assumes that a pattern is present in a record as long as their matching entries are more than their mismatching entries. Second, two error types, 1-becoming-0 and 0-becoming-1, are treated evenly, while in many application domains they are discriminated. To address the two issues, we propose weighted rank-one binary matrix approximation. It enables the tradeoff between the accuracy and succinctness in approximate data and allows users to impose their personal preferences on the importance of different error types. The decision problem, however, as proved in the paper is NP-complete. To solve it, several different mathematical programming formulations are provided, from which 2-approximation algorithms are derived for some special cases. An adaptive tabu search heuristic is presented for solving the general problem, and our experimental study shows the effectiveness of the heuristic.

[1]  Jieping Ye,et al.  Mining discrete patterns via binary matrix factorization , 2009, KDD.

[2]  Fred W. Glover,et al.  Solving the maximum edge weight clique problem via unconstrained quadratic programming , 2007, Eur. J. Oper. Res..

[3]  Vijayalakshmi Atluri,et al.  Extended Boolean Matrix Decomposition , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[4]  Joseph Naor,et al.  Tight bounds and 2-approximation algorithms for integer programs with two variables per inequality , 1993, Math. Program..

[5]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[6]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[7]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[8]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[9]  Gene H. Golub,et al.  Matrix computations , 1983 .

[10]  Joseph Naor,et al.  Simple and Fast Algorithms for Linear and Integer Programs With Two Variables per Inequality , 1994, SIAM J. Comput..

[11]  Naren Ramakrishnan,et al.  Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[13]  Vijayalakshmi Atluri,et al.  Optimal Boolean Matrix Decomposition: Application to Role Engineering , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[15]  Ananth Grama,et al.  PROXIMUS: a framework for analyzing very high dimensional discrete-attributed datasets , 2003, KDD '03.

[16]  F. Glover,et al.  Adaptive Memory Tabu Search for Binary Quadratic Programs , 1998 .

[17]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[18]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.