PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms

Matrix factorization is a basis for many recommendation systems. Although alternating least squares with weighted-\(\lambda \)-regularization (ALS-WR) is widely used in matrix factorization with collaborative filtering, this approach unfortunately incurs insufficient parallel execution and ineffective memory access. Thus, we propose a solution for accelerating the ALS-WR algorithm by exploiting parallelism, sparsity and locality on x86 platforms. Our PSL can process 20 million ratings and the speedup using multi-threading is up to 14.5\(\times \) on a 20-core machine.

[1]  Kai Hwang,et al.  Edge AIBench: Towards Comprehensive End-to-end Edge Computing Benchmarking , 2018, Bench.

[2]  Stijn Eyerman,et al.  Many-Core Graph Workload Analysis , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  C Hollowell,et al.  The Effect of NUMA Tunings on CPU Performance , 2015 .

[4]  Daniel Kusswurm Advanced Vector Extensions (AVX) , 2014 .

[5]  Tao Tang,et al.  Efficient and Portable ALS Matrix Factorization for Recommender Systems , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Yanjun Wu,et al.  RVTensor: A Light-Weight Neural Network Inference Framework Based on the RISC-V Architecture , 2019, Bench.

[7]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[8]  Guangli Li,et al.  XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips , 2019, Bench.

[9]  Yuchen Zhang,et al.  HPC AI500: A Benchmark Suite for HPC AI Systems , 2018, Bench.

[10]  Fan Zhang,et al.  AIoT Bench: Towards Comprehensive Benchmarking Mobile and Embedded Device Intelligence , 2018, Bench.

[11]  Tianshu Hao,et al.  The Implementation and Optimization of Matrix Decomposition Based Collaborative Filtering Task on X86 Platform , 2019, Bench.

[12]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[13]  Minghe Yu,et al.  AIBench: An Industry Standard Internet Service AI Benchmark Suite , 2019, ArXiv.

[14]  Xu Wen,et al.  Improving RGB-D Face Recognition via Transfer Learning from a Pretrained 2D Network , 2019, Bench.

[15]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[16]  Fan Zhang,et al.  AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking , 2018, Bench.

[17]  Brandon Lucia,et al.  Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing , 2019, HPDC.

[18]  Zheng Wang,et al.  Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[19]  Nicolas Gillis,et al.  Accelerating Nonnegative Matrix Factorization Algorithms Using Extrapolation , 2018, Neural Computation.

[20]  Xiaosong Ma,et al.  Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Nicolas Gillis,et al.  Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization , 2011, Neural Computation.

[22]  Minyi Guo,et al.  Excavating the Potential of GPU for Accelerating Graph Traversal , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Intel ® Guide for Developing Multithreaded Applications Part 1 : Application Threading and Synchronization Summary , 2010 .

[24]  Nectarios Koziris,et al.  SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms , 2018, ACM Trans. Math. Softw..