Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning

One-bit matrix completion is an important class of positiveunlabeled (PU) learning problems where the observations consist of only positive examples, e.g., in top-N recommender systems. For the first time, we show that 1-bit matrix completion can be formulated as the problem of recovering clean graph signals from noise-corrupted signals in hypergraphs. This makes it possible to enjoy recent advances in graph signal learning. Then, we propose the spectral graph matrix completion (SGMC) method, which can recover the underlying matrix in distributed systems by filtering the noisy data in the graph frequency domain. Meanwhile, it can provide microand macro-level explanations by following vertex-frequency analysis. To tackle the computational and memory issue of performing graph signal operations on large graphs, we construct a scalable Nyström algorithm which can efficiently compute orthonormal eigenvectors. Furthermore, we also develop polynomial and sparse frequency filters to remedy the accuracy loss caused by the approximations. We demonstrate the effectiveness of our algorithms on top-N recommendation tasks, and the results on three large-scale real-world datasets show that SGMC can outperform state-of-the-art top-N recommendation algorithms in accuracy while only requiring a small fraction of training time compared to the baselines. Introduction This paper considers the problem of recovering a 0-1 matrix M ∈ {0, 1}N×M only from positive and unlabeled data, which is also referred to as 1-bit matrix completion (Cai and Zhou 2013; Davenport et al. 2014; Hsieh, Natarajan, and Dhillon 2015). We assume that the positive samples are randomly drawn from {(i, u)|Mi,u = 1} with probability p(Mi,u = 1), and more precisely, we observe a subset Ω used for training in the presence of class-conditional random label noise which flips a 1 to 0 with probability ρ. Therefore, the unlabeled data is a mixture of unobserved positive examples and true negative examples, which raises challenges for formulating the underlying optimization problems. Recent works (Jahrer and Töscher 2012; Park et al. 2015; He et al. 2016; Li et al. 2016; Wu, Hsieh, and Sharpnack 2017, 2018) have showed that treating all unlabeled examples as negative examples in supervised learning can obtain ∗Corresponding author is Junchi Yan. Copyright c © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. decent performances in practice although the learned models can be biased (Kiryo et al. 2017). However, treating all unlabeled examples as negative examples will exhibit high computational overhead because all N × M examples in M should be considered in training (Hu, Koren, and Volinsky 2008; Mackey, Jordan, and Talwalkar 2011; Lee et al. 2013; Chen et al. 2015), which is prohibitive for many applications with large-scale matrices, e.g., recommendation on millions of songs (Bertin-Mahieux et al. 2011). The other issue of existing 1-bit matrix completion methods is the lack of explainability (Abdollahi and Nasraoui 2016; Zhang and Chen 2018), either due to the blackbox nature of neural networks (Wang, Wang, and Yeung 2015; Lian et al. 2018; Liang et al. 2018; Zhou et al. 2018) or to the broken connection between the past actions and the future prediction due to the introduction of latent variables or other transformations (Cao et al. 2007; Rendle et al. 2009; Rendle 2010; Zheng et al. 2018). To this end, in this paper we propose a scalable and explainable 1-bit matrix completion algorithm, namely spectral graph matrix completion (SGMC), in which the underlying matrix M is recovered by performing parallel signal processing on hypergraphs to improve the system scalability. The signals learned from the hypergraphs can be used to explain the models. To better illustrate the idea, let us consider the problem of top-N recommendation, in which we model a user’s historical records as a signal ru in a prespecified item-item hypergraph G where the value ru(i) at the i vertex represents whether the u user likes the i item. Then, we can pose the problem as recovering the underlying clean graph signal mu from a noisy signal ru that facilitates in flexible signal processing techniques such as graph Fourier transform and vertex-frequency analysis (Shuman et al. 2013; Shuman, Ricaud, and Vandergheynst 2016). Motivation. The graph signal processing perspective to 1bit matrix completion has been rarely studied yet appealing for several reasons. First, we are no longer limited to the geometrical proximity on the graph vertex domain, but are able to identify and exploit the structure in the graph frequency domain for potential performance improvement. Second, it enables us to take advantage of the well-developed vertexfrequency analysis to provide the explanations behind the predictions at both micro-level and macro-level. The main contributions of this paper are as follows: • To our best knowledge, this is the first work for developing a graph signal processing formulation to the 1-bit matrix completion problem, and quantitatively as well as qualitatively justify why a graph signal processing perspective is effective in this problem. • We propose a scalable and explainable algorithm, named spectral graph matrix completion, which minimizes unbiased risk estimators. With the benefit of spectral signals on graphs, our approach is enabled to provide microand macro-level explanations. This is one of the key differences to conventional solutions. • We construct a scalable Nyström algorithm to compute orthonormal eigenvectors. In general, our SGMC model consumesO(N(K+Lη)+LK) time andO(ηN(L+M)) memory where N L>K and N η. • Top-N recommendation results on large datasets show that SGMC achieves state-of-the-art ranking accuracy, provides reasonable explanations, and requires a small fraction of training time compared to the best-performing baseline. Related Work One-bit matrix completion approaches either regard unlabeled data as negative data with smaller weights (Pan et al. 2008; Hsieh, Natarajan, and Dhillon 2015; Li et al. 2016; He et al. 2016), or treat unlabeled data as weighted positive and negative simultaneously (Natarajan et al. 2013; Du Plessis, Niu, and Sugiyama 2014, 2015; Kiryo et al. 2017). The former (Hu, Koren, and Volinsky 2008; Jahrer and Töscher 2012; He et al. 2016) heavily relies on good choices of weights of unlabeled data, which is computationally expensive to tune. In practice, because the unlabeled dataset consists of both positive and negative data, this family of algorithms (Cao et al. 2007; Rendle et al. 2009; Park et al. 2015; Wu, Hsieh, and Sharpnack 2017, 2018) has a systematic estimation bias (Du Plessis, Niu, and Sugiyama 2014, 2015); By contrast, the latter focuses on unbiased risk estimators to avoid tuning the weights. However, most of existing works exhibit poor scalability due to high computation complexity on the very large matrix (Mackey, Jordan, and Talwalkar 2011; Lee et al. 2013; Chen et al. 2015). We propose a composite loss (Eq. (5)) that can cancel the bias, and by applying sparsity and orthonormality constraints (Eq. (14)) our approach can offer different levels of explanations behind the predictions. To scale to very large datasets, we also devise a parallel matrix approximation method which guarantees the orthogonality of the outputs. This makes a difference from prior research. More related works are provided in supplementary materials. The Model Throughout this paper, we denote scalars by either lowercase or uppercase letters, vectors by boldface lowercase letters, and matrices by boldface uppercase letters. Unless otherwise specified, all vectors are considered as columns vectors. In addition, we define the following definitions in this paper as: Definition 1. (Hypergraph). A undirected and connected hypergraph which consists of a finite set of vertices (items) I with |I| = N and a set of hyperedges (users) U with |U| = M , is defined as G = {I,U}. Each hyperedge is defined as a subset of I such that ∪u∈U = I, and a hyperedge u ∈ U containing only two vertices is a simple graph edge. Definition 2. (Incidence Matrix). Given any hypergraph G = {I,U}, we say a hyperedge u is incident with a vertex iwhen i ∈ u. Then, this hypergraph G can be represented by an N -by-M incidence matrix (item-user implicit feedback matrix) R defined as following: Ri,u = { 1 if i ∈ u 0 otherwise. (1) Definition 3. (Hypergraph Laplacian Matrix). Given any hypergraph G = {I,U}, we denote the degree of a vertex i by d(i) = ∑ u∈U Ri,u and the degree of a hyperedge u by δ(u) = ∑ i∈I Ri,u. Dv and De denote the diagonal matrices containing the vertex and hyperedge degrees, respectively. Then the hypergraph Laplacian matrix Ł can be defined as follows: Ł = I−D v RD e RD v . (2) Definition 4. (Graph Signal). For a hypergraph G = {I,U}, the data at each vertex in the graph is referred to as a graph signal that can be represented as a vector r ∈ R . The implicit data of the u user can be viewed as a signal ru on G, where the i entry of the vector ru is equal to Ri,u, namely ru(i) = Ri,u. Graph Fourier Transform The classical Fourier transform is defined as an expansion of a function f w.r.t. complex exponentials:

[1]  Li Shang,et al.  Low-Rank Matrix Approximation with Stability , 2016, ICML.

[2]  Xu Chen,et al.  Explainable Recommendation: A Survey and New Perspectives , 2018, Found. Trends Inf. Retr..

[3]  Gang Niu,et al.  Convex Formulation for Learning from Positive and Unlabeled Data , 2015, ICML.

[4]  Olfa Nasraoui,et al.  Explainable Matrix Factorization for Collaborative Filtering , 2016, WWW.

[5]  Qiang Yang,et al.  One-Class Collaborative Filtering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6]  George Karypis,et al.  SLIM: Sparse Linear Methods for Top-N Recommender Systems , 2011, 2011 IEEE 11th International Conference on Data Mining.

[7]  Nagarajan Natarajan,et al.  PU Learning for Matrix Completion , 2014, ICML.

[8]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[9]  Tat-Seng Chua,et al.  Neural Graph Collaborative Filtering , 2019, SIGIR.

[10]  Wen-Xin Zhou,et al.  A max-norm constrained minimization approach to 1-bit matrix completion , 2013, J. Mach. Learn. Res..

[11]  Harald Steck,et al.  Markov Random Fields for Collaborative Filtering , 2019, NeurIPS.

[12]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  George Karypis,et al.  Item-based top-N recommendation algorithms , 2004, TOIS.

[15]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[16]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[17]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[18]  Li Shang,et al.  WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation , 2015, SIGIR.

[19]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Tun Lu,et al.  Mixture Matrix Approximation for Collaborative Filtering , 2019 .

[21]  Cho-Jui Hsieh,et al.  SQL-Rank: A Listwise Approach to Collaborative Ranking , 2018, ICML.

[22]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Ameet Talwalkar,et al.  Divide-and-Conquer Matrix Factorization , 2011, NIPS.

[24]  Wei Liu,et al.  Mixture-Rank Matrix Approximation for Collaborative Filtering , 2017, NIPS.

[25]  Michael Jahrer,et al.  Collaborative Filtering Ensemble for Ranking , 2012, KDD Cup.

[26]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[27]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Pierre Vandergheynst,et al.  Vertex-Frequency Analysis on Graphs , 2013, ArXiv.

[30]  Harald Steck,et al.  Item popularity and recommendation accuracy , 2011, RecSys '11.

[31]  Ewout van den Berg,et al.  1-Bit Matrix Completion , 2012, ArXiv.

[32]  Jin Zhang,et al.  Preference Completion: Large-scale Collaborative Ranking from Pairwise Comparisons , 2015, ICML.

[33]  Dit-Yan Yeung,et al.  Collaborative Deep Learning for Recommender Systems , 2014, KDD.

[34]  Gang Niu,et al.  Analysis of Learning from Positive and Unlabeled Data , 2014, NIPS.

[35]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[36]  Lei Zheng,et al.  Spectral collaborative filtering , 2018, RecSys.

[37]  Junchi Yan,et al.  Modeling Dynamic User Preference via Dictionary Learning for Sequential Recommendation , 2022, IEEE Transactions on Knowledge and Data Engineering.

[38]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[39]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[40]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[41]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[42]  Yoram Singer,et al.  Local Low-Rank Matrix Approximation , 2013, ICML.

[43]  Ameet Talwalkar,et al.  Ensemble Nystrom Method , 2009, NIPS.

[44]  Tat-Seng Chua,et al.  Fast Matrix Factorization for Online Recommendation with Implicit Feedback , 2016, SIGIR.

[45]  James T. Kwok,et al.  Making Large-Scale Nyström Approximation Possible , 2010, ICML.

[46]  Matthew D. Hoffman,et al.  Variational Autoencoders for Collaborative Filtering , 2018, WWW.

[47]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[49]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[50]  James Bennett,et al.  The Netflix Prize , 2007 .

[51]  Li Shang,et al.  MPMA: Mixture Probabilistic Matrix Approximation for Collaborative Filtering , 2016, IJCAI.

[52]  Pascal Frossard,et al.  The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains , 2012, IEEE Signal Processing Magazine.

[53]  Cho-Jui Hsieh,et al.  Large-scale Collaborative Ranking in Near-Linear Time , 2017, KDD.