Predicting the Runtime of Memory Intensive Batch Workloads

Data centers use their non-peak business hours to run batch jobs in order to maximize the resource utilization. Large data centers may run thousands of batch jobs every day. These jobs typically process large volumes of data and can either be compute or memory intensive. Batch jobs may perform data reconciliation, risk analysis, and carry out analytics that are critical for the business. Hence, it is imperative that these jobs complete within the available time frame. New servers with large number of core and Non-Uniform Memory Access (NUMA) architecture provide very large compute capacity. This means that many batch jobs can be run concurrently to minimize their collective completion time. However, excessive parallelism may create memory access bottleneck and adversely affect the completion time. The objective of this work is to predict the completion time of concurrently running batch jobs. We assume each job to be multithreaded and memory intensive. A prediction model based on memory and CPU contention is proposed. Our predictions are based on server simulation that uses individual batch job data to predict the completion time of concurrently running jobs. The efficacy of our approach is validated using STREAM, a well-known open source synthetic benchmark. We also study the effect of hyper-threading and memory binding on prediction accuracy of our model.

[1]  Archana Ganapathi,et al.  Statistics-driven workload modeling for the Cloud , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[2]  Shudong Jin,et al.  Analysis of NUMA effects in modern multicore systems for the design of high-performance data transfer applications , 2017, Future Gener. Comput. Syst..

[3]  Dheeraj Chahal,et al.  DESiDE: Discrete Event Simulation Developers Environment , 2017, ICPE.

[4]  Wenzhi Chen,et al.  Precise contention-aware performance prediction on virtualized multicore system , 2017, J. Syst. Archit..

[5]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[6]  Giuliano Casale,et al.  Estimating Computational Requirements in Multi-Threaded Applications , 2015, IEEE Transactions on Software Engineering.

[7]  Manoj K. Nambiar,et al.  Performance Modeling of Multi-tiered Web Applications with Varying Service Demands , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[8]  Wei Wang,et al.  Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Palden Lama,et al.  AROMA: automated resource allocation and configuration of mapreduce environment in the cloud , 2012, ICAC '12.

[11]  Martin Schulz,et al.  Modeling the Impact of Reduced Memory Bandwidth on HPC Applications , 2014, Euro-Par.

[12]  Daniel A. Menascé,et al.  Predicting the Effect of Memory Contention in Multi-Core Computers Using Analytic Performance Models , 2015, IEEE Transactions on Computers.

[13]  Robert J. Fowler,et al.  Modeling memory concurrency for multi-socket multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[14]  David Black-Schaffer,et al.  Modeling performance variation due to cache sharing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[16]  Mary Lou Soffa,et al.  Characterizing multi-threaded applications based on shared-resource contention , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[17]  Thomas R. Gross,et al.  Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.

[18]  Dheeraj Chahal,et al.  PROWL: Towards Predicting the Runtime of Batch Workloads , 2018, ICPE Companion.

[19]  Abhishek Verma,et al.  Predicting Job Completion Time in Heterogeneous MapReduce Environments , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Maitreya Natu,et al.  Using analytics to understand and control batch operations — A case study , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[21]  Rui Yang,et al.  Memory and Thread Placement Effects as a Function of Cache Usage: A Study of the Gaussian Chemistry Code on the SunFire X4600 M2 , 2008, 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008).

[22]  Lars Bergstrom,et al.  Measuring NUMA effects with the STREAM benchmark , 2011, ArXiv.