Succinct Representations in Collaborative Filtering: A Case Study using Wavelet Tree on 1,000 Cores

User-Item (U-I) matrix has been used as the dominant data infrastructure of Collaborative Filtering (CF). To reduce space consumption in runtime and storage, caused by data sparsity and growing need to accommodate side information in CF design, one needs to go beyond the U-I Matrix. In this paper, we took a case study of Succinct Representations in Collaborative Filtering, rather than using a U-I Matrix. Our key insight is to introduce Succinct Data Structures as a new infrastructure of CF. Towards this, we implemented a User-based K-Nearest-Neighbor CF prototype via Wavelet Tree, by first designing a Accessible Compressed Documents (ACD) to compress U-I data in Wavelet Tree, which is efficient in both storage and runtime. Then, we showed that ACD can be applied to develop an efficient intersection algorithm without decompression, by taking advantage of ACD's characteristics. We evaluated our design on 1,000 cores of Tianhe-II supercomputer, with one of the largest public data set ml-20m. The results showed that our prototype could achieve 3.7 minutes on average to deliver the results.

[1]  Deepak Agarwal,et al.  Regression-based latent factor models , 2009, KDD.

[2]  Guy E. Blelloch,et al.  Parallel lightweight wavelet tree, suffix array and FM-index construction , 2017, J. Discrete Algorithms.

[3]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[4]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[5]  Vivien Quéma,et al.  The Linux scheduler: a decade of wasted cores , 2016, EuroSys.

[6]  John Riedl,et al.  An algorithmic framework for performing collaborative filtering , 1999, SIGIR '99.

[7]  Wenguang Chen,et al.  Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights , 2018, Proc. VLDB Endow..

[8]  Gonzalo Navarro,et al.  Extended Compact Web Graph Representations , 2010, Algorithms and Applications.

[9]  Giuseppe Ottaviano,et al.  The wavelet trie: maintaining an indexed sequence of strings in compressed space , 2012, PODS '12.

[10]  Diego Arroyuelo,et al.  Compressed Self-indices Supporting Conjunctive Queries on Document Collections , 2010, SPIRE.

[11]  Yehuda Koren,et al.  Yahoo! music recommendations: modeling music ratings with temporal dynamics and item taxonomy , 2011, RecSys '11.

[12]  Gonzalo Navarro,et al.  Wavelet trees for all , 2012, J. Discrete Algorithms.

[13]  Martha Larson,et al.  Collaborative Filtering beyond the User-Item Matrix , 2014, ACM Comput. Surv..

[14]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[15]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[16]  Wenguang Chen,et al.  Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data , 2018, ICS.

[17]  Joemon M. Jose,et al.  Handling data sparsity in collaborative filtering using emotion and semantic based features , 2011, SIGIR.

[18]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[19]  Andrew Musselman Apache Mahout , 2019, Encyclopedia of Big Data Technologies.

[20]  Viktor Leis,et al.  SuRF: Practical Range Query Filtering with Fast Succinct Tries , 2018, SIGMOD Conference.

[21]  Gonzalo Navarro,et al.  Reorganizing compressed text , 2008, SIGIR '08.