论文信息 - Improving spark application throughput via memory aware task co-location: a mixture of experts approach

Improving spark application throughput via memory aware task co-location: a mixture of experts approach

Data analytic applications built upon big data processing frameworks such as Apache Spark are an important class of applications. Many of these applications are not latency-sensitive and thus can run as batch jobs in data centers. By running multiple applications on a computing host, task co-location can significantly improve the server utilization and system throughput. However, effective task co-location is a non-trivial task, as it requires an understanding of the computing resource requirement of the co-running applications, in order to determine what tasks, and how many of them, can be co-located. State-of-the-art co-location schemes either require the user to supply the resource demands which are often far beyond what is needed; or use a one-size-fits-all function to estimate the requirement, which, unfortunately, is unlikely to capture the diverse behaviors of applications. In this paper, we present a mixture-of-experts approach to model the memory behavior of Spark applications. We achieve this by learning, off-line, a range of specialized memory models on a range of typical applications; we then determine at runtime which of the memory models, or experts, best describes the memory behavior of the target application. We show that by accurately estimating the resource level that is needed, a co-location scheme can effectively determine how many applications can be co-located on the same host to improve the system throughput, by taking into consideration the memory and CPU requirements of co-running application tasks. Our technique is applied to a set of representative data analytic applications built upon the Apache Spark framework. We evaluated our approach for system throughput and average normalized turnaround time on a multi-core cluster. Our approach achieves over 83.9% of the performance delivered using an ideal memory predictor. We obtain, on average, 8.69x improvement on system throughput and a 49% reduction on turnaround time over executing application tasks in isolation, which translates to a 1.28x and 1.68x improvement over a state-of-the-art co-location scheme for system throughput and turnaround time respectively.

Vicent Sanz Marco | Ben Taylor | Barry Porter | Z. Wang

[1] Ling Gao,et al. Optimise web browsing on heterogeneous mobile platforms: A machine learning based approach , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[2] Chris Cummins,et al. End-to-End Deep Learning of Optimization Heuristics , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3] Peter R. Pietzuch,et al. SquirrelJoin: Network-Aware Distributed Join Processing with Lazy Partitioning , 2017, Proc. VLDB Endow..

[4] Zheng Wang,et al. Adaptive optimization for OpenCL programs on embedded heterogeneous systems , 2017, LCTES.

[5] Christopher C. Cummins,et al. Synthesizing benchmarks for predictive modeling , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6] Pavlos Petoumenos,et al. Minimizing the cost of iterative compilation with active learning , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7] Benjamin C. Lee,et al. Cooper: Task Colocation with Cooperative Games , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8] Li Zhang,et al. MEMTUNE: Dynamic Memory Management for In-Memory Data Analytic Platforms , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9] Ping Zhang,et al. Predicting Drug-Drug Interactions Through Similarity-Based Link Prediction Over Web Data , 2016, WWW.

[10] Cong Xu,et al. vRead: Efficient Data Access for Hadoop in Virtualized Clouds , 2015, Middleware.

[11] Lu Fang,et al. Interruptible tasks: treating memory pressure as interrupts for highly scalable data-parallel programs , 2015, SOSP.

[12] Weisong Shi,et al. Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications , 2015, IEEE Transactions on Parallel and Distributed Systems.

[13] Vladimir Vlassov,et al. Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[14] Christoforos E. Kozyrakis,et al. Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15] Quan Chen,et al. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[16] Xi Yang,et al. Computer performance microscopy with Shim , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17] Michael F. P. O'Boyle,et al. Celebrating diversity: a mixture of experts approach for runtime mapping in dynamic environments , 2015, PLDI.

[18] Li Zhang,et al. SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[19] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[20] Ronald G. Dreslinski,et al. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers , 2015, ASPLOS.

[21] Lu Fang,et al. FACADE: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications , 2015, ASPLOS.

[22] Sally A. McKee,et al. Understanding the behavior of in-memory computing workloads , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[23] Xiaobo Zhou,et al. Improving MapReduce performance in heterogeneous environments with adaptive task tuning , 2014, Middleware.

[24] Michael F. P. O'Boyle,et al. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..

[25] Michael F. P. O'Boyle,et al. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[26] Tao Li,et al. Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[27] Srikanth Kandula,et al. Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[28] Arvind Krishnamurthy,et al. Proceedings of the 2014 ACM conference on SIGCOMM , 2014, SIGCOMM 2014.

[29] Christina Delimitrou,et al. Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[30] Michael F. P. O'Boyle,et al. Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.

[31] Tim Kraska,et al. MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[32] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[33] Michael F. P. O'Boyle,et al. OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.

[34] Michael F. P. O'Boyle,et al. Using machine learning to partition streaming programs , 2013, ACM Trans. Archit. Code Optim..

[35] Xiaona Li,et al. BigDataBench: a Big Data Benchmark Suite from Web Search Engines , 2013, ArXiv.

[36] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[37] Sameer Kulkarni,et al. Automatic construction of inlining heuristics using machine learning , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[38] Carlos Guestrin,et al. Usenix Association 10th Usenix Symposium on Operating Systems Design and Implementation (osdi '12) 31 Graphchi: Large-scale Graph Computation on Just a Pc , 2022 .

[39] R. Campbell,et al. Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[40] Lingjia Tang,et al. The impact of memory subsystem resource sharing on datacenter applications , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[41] Rares Vernica,et al. Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[42] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[43] Michael F. P. O'Boyle,et al. Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[45] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[46] Joseph M. Hellerstein,et al. MapReduce Online , 2010, NSDI.

[47] Stijn Eyerman,et al. Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[48] Alexandra Fedorova,et al. Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[49] Jie Huang,et al. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[50] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[51] Pete Wyckoff,et al. Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[52] Michael F. P. O'Boyle,et al. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[53] Sally A. McKee,et al. Real time power estimation and thread scheduling via performance counters , 2009, CARN.

[54] K. Datta,et al. A case for machine learning to optimize multicore performance , 2009 .

[55] Michael F. P. O'Boyle,et al. Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[56] Michael F. P. O'Boyle,et al. Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[57] Ethem Alpaydin,et al. Introduction to machine learning , 2004, Adaptive computation and machine learning.

[58] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[59] B. Manly. Multivariate Statistical Methods : A Primer , 1986 .

[60] James M. Keller,et al. A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[61] Norman May,et al. SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA , 2013, BTW.

[62] P. Sadayappan,et al. Using machine learning to improve automatic vectorization , 2012, TACO.

[63] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .