Ternary Matrix Factorization: problem definitions and algorithms

Can we learn from the unknown? Logical data sets of the ternary kind are often found in information systems. They contain unknown as well as true/false values. An unknown value may represent a missing entry (lost or indeterminable) or have meaning, like a Don’t Know response in a questionnaire. In this paper, we introduce algorithms for reducing the dimensionality of logical data (categorical data in general) in the context of a new data mining challenge: Ternary Matrix Factorization (TMF). For a ternary data matrix, TMF exploits ternary logic to produce a basis matrix (which holds the major patterns in the data) and a usage matrix (which maps patterns to original observations). Both matrices are interpretable, and their ternary matrix product approximates the original matrix. TMF has applications in (1) finding targeted structure in ternary data, (2) imputing values through pattern discovery in highly incomplete categorical data sets, and (3) solving instances of its encapsulated Binary Matrix Factorization problem. Our elegant algorithm FasTer (FASt TERnary Matrix Factorization) has linear run-time complexity with respect to the dimensions of the data set and is parameter-robust. A variant of FasTer that exploits useful results from combinatorics provides accuracy bounds for a core part of the algorithm in certain situations. Experiments on synthetic and real-world data sets show that our algorithms are able to outperform state-of-the-art techniques in all three TMF applications with respect to run-time and effectiveness. Finally, convincing speedup and efficiency results on a parallel version of FasTer demonstrate its suitability for weak- and strong-scaling scenarios.

[1]  Malik Magdon-Ismail,et al.  On selecting a maximum volume sub-matrix of a matrix and related problems , 2009, Theor. Comput. Sci..

[2]  Joe D. Francis,et al.  WHAT WE NOW KNOW ABOUT “I DON'T KNOWS” , 1975 .

[3]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[4]  E. F. Codd,et al.  Missing information (applicable and inapplicable) in relational databases , 1986, SGMD.

[5]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[6]  Dov M. Gabbay,et al.  Handbook of the history of logic , 2004 .

[7]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[8]  D. Rubin,et al.  Handling “Don't Know” Survey Responses: The Case of the Slovenian Plebiscite , 1995 .

[9]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Pauli Miettinen The Boolean Column and Column-Row Matrix Decompositions , 2008, ECML/PKDD.

[11]  Jilles Vreeken,et al.  Filling in the Blanks - Krimp Minimisation for Missing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  J. Young,et al.  Plantæ , 1871, Transactions of the Glasgow Geological Society.

[13]  David Peleg,et al.  Approximation algorithms for the Label-CoverMAX and Red-Blue Set Cover problems , 2000, J. Discrete Algorithms.

[14]  Vijayalakshmi Atluri,et al.  Optimal Boolean Matrix Decomposition: Application to Role Engineering , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[16]  Salvatore Orlando,et al.  Mining Top-K Patterns from Binary Datasets in Presence of Noise , 2010, SDM.

[17]  Pauli Miettinen,et al.  On the Positive-Negative Partial Set Cover problem , 2008, Inf. Process. Lett..

[18]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[19]  Grzegorz Malinowski,et al.  Many-valued logic and its philosophy , 2007, The Many Valued and Nonmonotonic Turn in Logic.

[20]  E. F. Codd More commentary on missing information in relational databases (applicable and inapplicable information) , 1987, SGMD.

[21]  Pauli Miettinen,et al.  MDL4BMF: Minimum Description Length for Boolean Matrix Factorization , 2014, TKDD.

[22]  Franklin T. Luk,et al.  A parallel method for computing the generalized singular value decomposition , 1985, 1985 IEEE 7th Symposium on Computer Arithmetic (ARITH).

[23]  Vilém Vychodil,et al.  Discovery of optimal factors in binary data via a novel method of matrix decomposition , 2010, J. Comput. Syst. Sci..

[24]  Stephen Cole Kleene,et al.  On notation for ordinal numbers , 1938, Journal of Symbolic Logic.

[25]  Claudia Plant,et al.  Ternary Matrix Factorization: problem definitions and algorithms , 2014, Knowledge and Information Systems.

[26]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[27]  Christos Faloutsos,et al.  Fast and reliable anomaly detection in categorical data , 2012, CIKM.

[28]  Radim Belohlávek,et al.  Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data , 2013, 2013 IEEE 13th International Conference on Data Mining.