Clustering Boolean tensors

Graphs—such as friendship networks—that evolve over time are an example of data that are naturally represented as binary tensors. Similarly to analysing the adjacency matrix of a graph using a matrix factorization, we can analyse the tensor by factorizing it. Unfortunately, tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations—where the input tensor and all the factors are required to be binary and we use Boolean algebra—much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. For instance, in the aforementioned time-evolving friendship networks, groups of friends might be overlapping, but the time points at which the network was captured are always distinct. In this paper we investigate what consequences this partitioning has on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithm with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and real-world data sets.

[1]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[2]  K. Selçuk Candan,et al.  Decomposition-by-normalization (DBN): leveraging approximate functional dependencies for efficient tensor decomposition , 2012, CIKM.

[3]  Pauli Miettinen,et al.  Walk 'n' Merge: A Scalable Algorithm for Boolean Tensor Factorization , 2013, 2013 IEEE 13th International Conference on Data Mining.

[4]  Jon M. Kleinberg,et al.  A Microeconomic View of Data Mining , 1998, Data Mining and Knowledge Discovery.

[5]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[6]  Jonas Poelmans,et al.  Can triconcepts become triclusters? , 2013, Int. J. Gen. Syst..

[7]  Tsvi Kuflik,et al.  Workshop on information heterogeneity and fusion in recommender systems (HetRec 2010) , 2010, RecSys '10.

[8]  Jouni K. Seppänen,et al.  Upper bound for the approximation ratio of a class of hypercube segmentation algorithms , 2005, Inf. Process. Lett..

[9]  Peng Jiang Pattern extraction and clustering for high-dimensional discrete data , 2013 .

[10]  Tamara G. Kolda,et al.  On Tensors, Sparsity, and Nonnegative Factorizations , 2011, SIAM J. Matrix Anal. Appl..

[11]  Noga Alon,et al.  On Two Segmentation Problems , 1999, J. Algorithms.

[12]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[13]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[14]  K. Selçuk Candan,et al.  Approximate tensor decomposition within a tensor-relational algebraic framework , 2011, CIKM '11.

[15]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[16]  Jean-François Boulicaut,et al.  Closed patterns meet n-ary relations , 2009, TKDD.

[17]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[18]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[19]  Tsvi Kuflik,et al.  Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011) : 27th October 2011, Chicago, IL, USA , 2011 .

[20]  Iven Van Mechelen,et al.  Indclas: A three-way hierarchical classes model , 1999 .

[21]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[22]  Suvrit Sra,et al.  Approximation Algorithms for Tensor Clustering , 2009, ALT.

[23]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[24]  Pauli Miettinen Sparse Boolean Matrix Factorizations , 2010, 2010 IEEE International Conference on Data Mining.

[25]  Nikos D. Sidiropoulos,et al.  From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors , 2013, IEEE Transactions on Signal Processing.

[26]  Bart De Moor,et al.  Hybrid Clustering of Multiple Information Sources via HOSVD , 2010, ISNN.

[27]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[28]  Jean-François Boulicaut,et al.  Closed and noise-tolerant patterns in n-ary relations , 2012, Data Mining and Knowledge Discovery.

[29]  Pauli Miettinen,et al.  Boolean Tensor Factorizations , 2011, 2011 IEEE 11th International Conference on Data Mining.

[30]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[31]  K. Selçuk Candan,et al.  Pushing-Down Tensor Decompositions over Unions to Promote Reuse of Materialized Decompositions , 2014, ECML/PKDD.

[32]  Oren Etzioni,et al.  Unsupervised Methods for Determining Object and Relation Synonyms on the Web , 2014, J. Artif. Intell. Res..

[33]  Leonid Zhukov,et al.  From Triconcepts to Triclusters , 2011, RSFDGrC.

[34]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[35]  Pauli Miettinen,et al.  Matrix Decomposition Methods for Data Mining : Computational Complexity and Algorithms , 2009 .

[36]  Chris H. Q. Ding,et al.  Simultaneous tensor subspace selection and clustering: the equivalence of high order svd and k-means clustering , 2008, KDD.

[37]  Nikos D. Sidiropoulos,et al.  ParCube: Sparse Parallelizable Tensor Decompositions , 2012, ECML/PKDD.

[38]  Pauli Miettinen,et al.  MDL4BMF: Minimum Description Length for Boolean Matrix Factorization , 2014, TKDD.

[39]  Cynthia Vera Glodeanu,et al.  Optimal Factorization of Three-Way Binary Data Using Triadic Concepts , 2013, Order.

[40]  Pauli Miettinen,et al.  Discovering facts with boolean tensor tucker decomposition , 2013, CIKM.