Distributed machine learning load balancing strategy in cloud computing services

Mobile service computing is a new cloud computing model that provides various cloud services for mobile intelligent terminal users through mobile internet access. The quality of service is an essential problem faced by mobile service computing. In this paper, we demonstrate a series of research studies on how to accelerate the training of a distributed machine learning (ML) model based on cloud service. Distributed ML has become the mainstream way of today’s ML models training. In traditional distributed ML based on bulk synchronous parallel, the temporary slowdown of any node in the cluster will delay the calculation of other nodes because of the frequent occurrence of synchronous barriers, resulting in overall performance degradation. Our paper proposes a load balancing strategy named adaptive fast reassignment (AdaptFR). Based on this, we built a distributed parallel computing model called adaptive-dynamic synchronous parallel (A-DSP). A-DSP uses a more relaxed synchronization model to reduce the performance consumption caused by synchronous operations while ensuring the consistency of the model. At the same time, A-DSP also implements the AdaptFR load balancing strategy, which addresses the straggler problem caused by the performance difference between nodes under the premise of ensuring the accuracy of the model. The experiments show that A-DSP can effectively improve the training speed while ensuring the accuracy of the model in the distributed ML model training.

[1]  Weiwei Xia,et al.  Joint resource allocation using evolutionary algorithms in heterogeneous mobile cloud computing networks , 2018, China Communications.

[2]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[3]  Doug Terry,et al.  Replicated data consistency explained through baseball , 2013, CACM.

[4]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[5]  Yueshen Xu,et al.  Network Location-Aware Service Recommendation with Random Walk in Cyber-Physical Systems , 2017, Sensors.

[6]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[7]  Fangfang Li,et al.  Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format , 2016, Future Gener. Comput. Syst..

[8]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[9]  Lei Zhang,et al.  A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors , 2017, Sensors.

[10]  Pengtao Xie,et al.  Strategies and Principles of Distributed Machine Learning on Big Data , 2015, ArXiv.

[11]  Vyacheslav S. Kharchenko,et al.  The threat of uncertainty in service-oriented architecture , 2008, SERENE '08.

[12]  Jun Yu,et al.  Multitask Autoencoder Model for Recovering Human Poses , 2018, IEEE Transactions on Industrial Electronics.

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[16]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[17]  Yifan Zhang,et al.  An Automatically Learning and Discovering Human Fishing Behaviors Scheme for CPSCN , 2018, IEEE Access.

[18]  Albert Y. Zomaya,et al.  Composition-Driven IoT Service Provisioning in Distributed Edges , 2018, IEEE Access.

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Yucong Duan,et al.  An Approach to Data Consistency Checking for the Dynamic Replacement of Service Process , 2017, IEEE Access.

[21]  Cheng Zhang,et al.  A Density-Based Offloading Strategy for IoT Devices in Edge Computing Systems , 2018, IEEE Access.

[22]  Lilan Liu,et al.  Automated Quantitative Verification for Service-Based System Design: A Visualization Transform Tool Perspective , 2018, Int. J. Softw. Eng. Knowl. Eng..

[23]  Li Zhou,et al.  An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning , 2018, IEEE Access.

[24]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[25]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[26]  Honghao Gao,et al.  Applying Probabilistic Model Checking to Financial Production Risk Evaluation and Control: A Case Study of Alibaba’s Yu’e Bao , 2018, IEEE Transactions on Computational Social Systems.

[27]  Jianping Fan,et al.  Leveraging Content Sensitiveness and User Trustworthiness to Recommend Fine-Grained Privacy Settings for Social Image Sharing , 2018, IEEE Transactions on Information Forensics and Security.

[28]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[29]  Naghmeh S. Moayedian,et al.  An Offloading Strategy in Mobile Cloud Computing Considering Energy and Delay Constraints , 2018, IEEE Access.

[30]  Kang Zhang,et al.  Applying improved particle swarm optimization for dynamic service composition focusing on quality of service evaluations under hybrid networks , 2018, Int. J. Distributed Sens. Networks.

[31]  Eric P. Xing,et al.  Exploiting iterative-ness for parallel ML computations , 2014, SoCC.

[32]  Eric P. Xing,et al.  Managed communication and consistency for fast data-parallel iterative analytics , 2015, SoCC.

[33]  Li Zhou,et al.  Efficient parallel implementation of incompressible pipe flow algorithm based on SIMPLE , 2016, Concurr. Comput. Pract. Exp..

[34]  Yucong Duan,et al.  Toward service selection for workflow reconfiguration: An interface-based computing solution , 2018, Future Gener. Comput. Syst..

[35]  Li Zhou,et al.  A Parallel Strategy for Convolutional Neural Network Based on Heterogeneous Cluster for Mobile Information System , 2017, Mob. Inf. Syst..

[36]  S. Sitharama Iyengar,et al.  Multiresolution data integration using mobile agents in distributed sensor networks , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[37]  Wu-Jun Li,et al.  Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee , 2016, AAAI.

[38]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[39]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1994, J. Parallel Distributed Comput..

[40]  Jian Wan,et al.  Location-Aware Service Recommendation With Enhanced Probabilistic Matrix Factorization , 2018, IEEE Access.

[41]  L. Deng,et al.  The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web] , 2012, IEEE Signal Processing Magazine.

[42]  Yucong Duan,et al.  Probabilistic Model Checking-Based Service Selection Method for Business Process Modeling , 2017, Int. J. Softw. Eng. Knowl. Eng..