Parallelizing Stochastic Gradient Descent with Hardware Transactional Memory for Matrix Factorization

Rapid increase of the amount of available data necessitates large-scale machine learning methods, and Stochastic Gradient Descent (SGD) has become a predominant one of the choices. However, the inherently sequential properties of SGD severely constrain its scalability and prevent it benefiting from multi-core devices. This work parallelizes SGD with transactional memory and leverages hardware support of transactional execution to explore better use of newly deployed features in commercial multi-core processors. To evaluate the performance of our SGD implementation, we compare it with the traditional lock-based approach and conduct quantitative analysis of its synchronization overhead on real world datasets. Experimental results show that the proposed parallelized SGD implementation achieves satisfied scalability and improved execution performance compared with the lock-based approach.

[1]  Inderjit S. Dhillon,et al.  Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems , 2012, 2012 IEEE 12th International Conference on Data Mining.

[2]  Yehuda Afek,et al.  Software-improved hardware lock elision , 2014, PODC '14.

[3]  Zhigang Luo,et al.  NeNMF: An Optimal Gradient Method for Nonnegative Matrix Factorization , 2012, IEEE Transactions on Signal Processing.

[4]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[5]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.

[6]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Keshav Pingali,et al.  Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[8]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[9]  Christopher Ré,et al.  High Performance Parallel Stochastic Gradient Descent in Shared Memory , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Zhigang Luo,et al.  Manifold Regularized Discriminative Nonnegative Matrix Factorization With Fast Gradient Descent , 2011, IEEE Transactions on Image Processing.

[11]  James Bennett,et al.  The Netflix Prize , 2007 .

[12]  David Dice,et al.  Refined transactional lock elision , 2016, PPOPP.