EZLDA: Efficient and Scalable LDA on GPUs

LDA is a statistical approach for topic modeling with a wide range of applications. However, there exist very few attempts to accelerate LDA on GPUs which come with exceptional computing and memory throughput capabilities. To this end, we introduce EZLDA which achieves efficient and scalable LDA training on GPUs with the following three contributions: First, EZLDA introduces three-branch sampling method which takes advantage of the convergence heterogeneity of various tokens to reduce the redundant sampling task. Second, to enable sparsity-aware format for both D and W on GPUs with fast sampling and updating, we introduce hybrid format for W along with corresponding token partition to T and inverted index designs. Third, we design a hierarchical workload balancing solution to address the extremely skewed workload imbalance problem on GPU and scaleEZLDA across multiple GPUs. Taken together, EZLDA achieves superior performance over the state-of-the-art attempts with lower memory consumption.

[1]  Wenguang Chen,et al.  SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs , 2017, ASPLOS.

[2]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[3]  Yuxiong He,et al.  GRNN: Low-Latency and Scalable RNN Inference on GPUs , 2019, EuroSys.

[4]  Yelong Shen,et al.  End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture , 2015, NIPS.

[5]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[6]  Jianxun Liu,et al.  Functional and Contextual Attention-Based LSTM for Service Recommendation in Mashup Creation , 2019, IEEE Transactions on Parallel and Distributed Systems.

[7]  Ulrich Meyer,et al.  Delta-Stepping: A Parallel Single Source Shortest Path Algorithm , 1998, ESA.

[8]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[9]  Rio Yokota,et al.  Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[10]  John Canny,et al.  SAME but Different: Fast and High Quality Gibbs Parameter Estimation , 2014, KDD.

[11]  Alexander J. Smola,et al.  Exponential Stochastic Cellular Automata for Massively Parallel Inference , 2016, AISTATS.

[12]  Marc Snir,et al.  Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems , 2018, 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC).

[13]  Takuya Akiba,et al.  ChainerMN: Scalable Distributed Deep Learning Framework , 2017, ArXiv.

[14]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[15]  Yibo Wang,et al.  Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud , 2018, Decis. Support Syst..

[16]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[19]  Fan Yao,et al.  XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs , 2019, HPDC.

[20]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[21]  Fei-Fei Li,et al.  Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  Ralf Krestel,et al.  Latent dirichlet allocation for tag recommendation , 2009, RecSys '09.

[23]  Bo Wu,et al.  Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Lixin Gao,et al.  Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation , 2017, 1710.05785.

[25]  Feng Yan,et al.  Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units , 2009, NIPS.

[26]  Wenguang Chen,et al.  WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation , 2015, Proc. VLDB Endow..

[27]  Julian Kates-Harbeck,et al.  Training distributed deep recurrent neural networks with mixed precision on GPU clusters , 2017, MLHPC@SC.

[28]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Wenguang Chen,et al.  Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data , 2018, ICS.

[31]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[32]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[33]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[34]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[35]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[36]  Bin Cui,et al.  LDA*: A Robust and Large-scale Topic Modeling System , 2017, Proc. VLDB Endow..

[37]  Chi Zhang,et al.  Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs , 2018, USENIX Annual Technical Conference.

[38]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[39]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[40]  Andreas Gerstlauer,et al.  Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction , 2018, Proc. VLDB Endow..

[41]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[42]  Edward Y. Chang,et al.  Collaborative filtering for orkut communities: discovery of user latent behavior , 2009, WWW '09.

[43]  Baobao Chang,et al.  Syntax Aware LSTM Model for Chinese Semantic Role Labeling , 2017, ArXiv.

[44]  John D. Owens,et al.  A Dynamic Hash Table for the GPU , 2017, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[45]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[46]  Yun Liang,et al.  CuLDA_CGS: solving large-scale LDA problems on GPUs , 2018, PPoPP.

[47]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[48]  Jun Zhu,et al.  Distributing the Stochastic Gradient Sampler for Large-Scale LDA , 2016, KDD.

[49]  Tie-Yan Liu,et al.  LightLDA: Big Topic Models on Modest Computer Clusters , 2014, WWW.

[50]  Wenguang Chen,et al.  Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights , 2018, Proc. VLDB Endow..