Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading

Much research work devotes to tuning big data analytics in modern data centers, since even a small percentage of performance improvement immediately translates to huge cost savings because of the large scale. Simultaneous multithreading (SMT) receives great interest from data center communities, as it has the potential to boost performance of big data analytics by increasing the processor resources utilization. For example, the emerging processor architectures like POWER8 support up to 8-way multithreading. However, as different big data workloads have disparate architectural characteristics, how to identify the most efficient SMT configuration to achieve the best performance is challenging in terms of both complex application behaviors and processor architectures. In this paper, we specifically focus on auto-tuning SMT configuration for Spark-based big data workloads on POWER8. However, our methodology could be generalized and extended to other programming software stacks and other architectures. We propose a prediction-based dynamic SMT threading (PBDST) framework to adjust the thread count in SMT cores on POWER8 processors by using versatile machine learning algorithms. Its innovation lies in adopting online SMT configuration predictions derived from microarchitecture level profiling, to regulate the thread counts that could achieve nearly optimal performance. Moreover it is implemented at Spark software stack layer and transparent to user applications. After evaluating a large set of machine learning algorithms, we choose the most efficient ones to perform online predictions. The experimental results demonstrate that our approach can achieve up to 56.3% performance improvement and an average performance gain of 16.2% in comparison with the default configuration-the maximum SMT configuration-SMT8 on our system.

[1]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[2]  John D. McCalpin,et al.  Characterization of simultaneous multithreading (SMT) efficiency in POWER5 , 2005, IBM J. Res. Dev..

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  A. Janiszewski,et al.  Architectural support for enhanced SMT job scheduling , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[5]  Mihai Burcea,et al.  An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs , 2004, PDCS.

[6]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[7]  Michael Gschwind,et al.  IBM POWER8 processor core microarchitecture , 2015, IBM J. Res. Dev..

[8]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  AilamakiAnastasia,et al.  Clearing the clouds , 2012 .

[10]  Balaram Sinharoy,et al.  Advanced features in IBM POWER8 systems , 2015, IBM J. Res. Dev..

[11]  Michael Voss,et al.  Runtime empirical selection of loop schedulers on hyperthreaded SMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[13]  Bronis R. de Supinski,et al.  Prediction models for multi-dimensional power-performance optimization on many cores , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[15]  Christoforos E. Kozyrakis,et al.  Dynamic management of TurboMode in modern multi-core chips , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[16]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[17]  Peter Harrington,et al.  Machine Learning in Action , 2012 .

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Eduard Ayguadé,et al.  Decomposable and responsive power models for multicore processors using performance counters , 2010, ICS '10.

[21]  Jaejin Lee,et al.  Adaptive execution techniques of parallel programs for multiprocessors , 2010, J. Parallel Distributed Comput..

[22]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[23]  Lieven Eeckhout,et al.  Undersubscribed threading on clustered cache architectures , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[24]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[25]  Timothy Creech Efficient multiprogramming for multicores with SCAF , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[27]  Pradip Bose,et al.  Crank it up or dial it down: Coordinated multiprocessor frequency and folding control , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[29]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[30]  H. Peter Hofstee,et al.  PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor , 2016, IEEE Computer Architecture Letters.

[31]  Dirk Grunwald,et al.  Methods for modeling resource contention on simultaneous multithreading processors , 2005, 2005 International Conference on Computer Design.

[32]  Alexandra Fedorova,et al.  An SMT-Selection Metric to Improve Multithreaded Applications' Performance , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[33]  Erich M. Nahum,et al.  Evaluating the impact of simultaneous multithreading on network servers using real hardware , 2005, SIGMETRICS '05.

[34]  Dimitrios S. Nikolopoulos,et al.  Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes , 2008, IEEE Transactions on Parallel and Distributed Systems.

[35]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[36]  Donald Nguyen,et al.  Machine learning-based prefetch optimization for data center applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[37]  J. Morris Chang,et al.  Performance Characterization of Java Applications on SMT Processors , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[38]  Dean M. Tullsen,et al.  Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[39]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[40]  Jaejin Lee,et al.  Adaptive execution techniques for SMT multiprocessor architectures , 2005, PPOPP.

[41]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[42]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[43]  Lieven Eeckhout,et al.  Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor , 2014, ROSS@ICS.

[44]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[45]  Douglas C. Montgomery,et al.  Introduction to Linear Regression Analysis, Solutions Manual (Wiley Series in Probability and Statistics) , 2007 .

[46]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[47]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.