High Performance Offline and Online Distributed Collaborative Filtering

Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.

[1]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[2]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[3]  Anupam Joshi,et al.  Highly scalable parallel collaborative filtering algorithm , 2010, 2010 International Conference on High Performance Computing.

[4]  Zeeshan Syed,et al.  From netflix to heart attacks: collaborative filtering in medical datasets , 2010, IHI.

[5]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[6]  John Riedl,et al.  Recommender systems in e-commerce , 1999, EC '99.

[7]  Arindam Banerjee,et al.  Multi-way Clustering on Relation Graphs , 2007, SDM.

[8]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[9]  John Riedl,et al.  Application of Dimensionality Reduction in Recommender Systems , 2000 .

[10]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[11]  Nicholas Ampazis,et al.  Collaborative Filtering via Concept Decomposition on the Netflix Dataset , 2008 .

[12]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[13]  Beat Signer,et al.  Spatio-Temporal Proximity as a basis for Collaborative Filtering in Mobile Environments , 2006, UMICS.

[14]  Hyuk Cho,et al.  Scalable Co-clustering Algorithms , 2010, ICA3PP.

[15]  David Heckerman,et al.  Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[16]  John Riedl,et al.  Analysis of recommendation algorithms for e-commerce , 2000, EC '00.

[17]  Joydeep Ghosh,et al.  Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data , 2009, KDD.

[18]  Matthew Brand,et al.  Fast Online SVD Revisions for Lightweight Recommender Systems , 2003, SDM.

[19]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[20]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[21]  Abhinav Srivastava,et al.  Distributed Scalable Collaborative Filtering Algorithm , 2011, Euro-Par.

[22]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[23]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[24]  Jaideep Srivastava,et al.  I/O Scalable Bregman Co-clustering , 2008, PAKDD.