DPS: A DSM-based Parameter Server for Machine Learning

To solve the problem of efficient storing and updating of model parameters in the learning process, the parameter server is concerned as a high-throughput distributed machine learning (ML) architecture with the emergence of big models with billions of parameters. Current parameter servers, such as the Parameter Server and the Petuum, do not address data management and lack high-level data abstraction. Moreover, they have no task scheduling and do not fully utilize the computing resource as well as possibly lead to load imbalance. Their programming interface is too complicated and they do not support data flow operations (e.g. map/reduce) which are very useful for data preprocessing. These drawbacks limit the performance and ease of use of such parameter servers.In this paper, we proposed DPS, a parameter server based on Distributed Shared Memory (DSM) for machine learning. DPS provides flexible consistency models, high-level data abstraction and management that support data flow operations, lightweight task scheduling system and user-friendly programming interface to solve the problems of existing systems mentioned above. The experimental results show that DPS can reduce networking time by about 50%, and achieve up to 1.9x performance compared to Petuum while the algorithms implemented on DPS use less code than those implemented on Petuum. In this paper, we proposed DPS, a parameter server based on Distributed Shared Memory (DSM) for machine learning. DPS provides flexible consistency models, high-level data abstraction and management that support data flow operations, lightweight task scheduling system and user-friendly programming interface to solve the problems of existing systems mentioned above. The experimental results show that DPS can reduce networking time by about 50%, and achieve up to 1.9x performance compared to Petuum while the algorithms implemented on DPS use less code than those implemented on Petuum.

[1]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[4]  Jinpeng Huai,et al.  Ring: Real-Time Emerging Anomaly Monitoring System Over Text Streams , 2019, IEEE Transactions on Big Data.

[5]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[6]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[7]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[9]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[10]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[11]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[12]  Allan Porterfield,et al.  Exploiting heterogeneous parallelism on a multithreaded multiprocessor , 1992, ICS '92.

[13]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[16]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[17]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[18]  Inderjit S. Dhillon,et al.  Generalized Nonnegative Matrix Approximations with Bregman Divergences , 2005, NIPS.