ALX: Large Scale Matrix Factorization on TPUs

We present ALX, an open-source library for distributed matrix factorization using Alternating Least Squares, written in JAX. Our design allows for efficient use of the TPU architecture and scales well to matrix factorization problems of O(B) rows/columns by scaling the number of available TPU cores. In order to spur future research on large scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built a real world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem. We created several variants of this dataset based on locality and sparsity properties of sub-graphs. The largest variant of WebGraph has around 365M nodes and training a single epoch finishes in about 20 minutes with 256 TPU cores. We include speed and performance numbers of ALX on all variants of WebGraph. Both the framework code and the dataset will be open-sourced.

[1]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Benjamin M. Marlin,et al.  Collaborative Filtering: A Machine Learning Perspective , 2004 .

[3]  Liana L. Fong,et al.  Matrix Factorization on GPUs with Memory Optimization and Approximate Computing , 2018, ICPP.

[4]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[5]  Inderjit S. Dhillon,et al.  NOMAD: Nonlocking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion , 2013, Proc. VLDB Endow..

[6]  Yehuda Koren,et al.  On the Difficulty of Evaluating Baselines: A Study on Recommender Systems , 2019, ArXiv.

[7]  Parikshit Ram,et al.  Maximum inner-product search using cone trees , 2012, KDD.

[8]  Walid Krichene,et al.  Revisiting the Performance of iALS on Item Recommendation Benchmarks , 2021, RecSys.

[9]  Wei Liu,et al.  Learning Binary Codes for Maximum Inner Product Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Venu Satuluri,et al.  Factorbird - a Parameter Server Approach to Distributed Matrix Factorization , 2014, ArXiv.

[11]  Liana L. Fong,et al.  Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs , 2016, HPDC.

[12]  John R. Anderson,et al.  Efficient Training on Very Large Corpora via Gramian Estimation , 2018, ICLR.

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[15]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[16]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[17]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.