Analyzing and optimizing task granularity on the JVM

Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, resulting in missed parallelization opportunities. In this paper, we provide a better understanding of task granularity for applications running on a Java Virtual Machine. We present a novel profiler which measures the granularity of every executed task. Our profiler collects carefully selected metrics from the whole system stack with only little overhead, and helps the developer locate performance problems. We analyze task granularity in the DaCapo and ScalaBench benchmark suites, revealing several inefficiencies related to fine-grained and coarse-grained tasks. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in two benchmarks, achieving speedups up to 1.53x.

[1]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[2]  Bradley C. Kuszmaul,et al.  The Cilkprof Scalability Profiler , 2015, SPAA.

[3]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[4]  Eduardo Rosales,et al.  Accurate reification of complete supertype information for dynamic analysis on the JVM , 2017, GPCE.

[5]  Stijn Eyerman,et al.  Bottle graphs: visualizing scalability bottlenecks in multi-threaded applications , 2013, OOPSLA.

[6]  Albert Noll,et al.  Online feedback-directed optimizations for parallel Java code , 2013, OOPSLA.

[7]  Santosh Nagarakatte,et al.  A fast causal profiler for task parallel programs , 2017, ESEC/SIGSOFT FSE.

[8]  J. Morris Chang,et al.  Multithreading in Java: Performance and Scalability on Multicore Systems , 2011, IEEE Transactions on Computers.

[9]  Other Contributors Are Indicated Where They Contribute The Eclipse Foundation , 2017 .

[10]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[11]  Petr Tuma,et al.  ShadowVM: robust and comprehensive dynamic program analysis for the java platform , 2014, GPCE '13.

[12]  Laurie J. Hendren,et al.  Dynamic metrics for java , 2003, OOPSLA '03.

[13]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[14]  Matthias Hauswirth,et al.  Vertical profiling: understanding the behavior of object-priented applications , 2004, OOPSLA.

[15]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[16]  Hanspeter Mössenböck,et al.  A Comprehensive Java Benchmark Study on Memory and Garbage Collection Behavior of DaCapo, DaCapo Scala, and SPECjvm2008 , 2017, ICPE.

[17]  Alexandra Fedorova,et al.  Deconstructing the overhead in parallel applications , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Kishori Sharan Java Native Interface , 2014 .

[19]  Andrea Rosà,et al.  Actor profiling in virtual execution environments , 2016, GPCE.

[20]  Jan Vitek,et al.  A black-box approach to understanding concurrency in DaCapo , 2012, OOPSLA '12.

[21]  Walter Binder,et al.  DiSL: a domain-specific language for bytecode instrumentation , 2012, AOSD.

[22]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[23]  Tingting Yu,et al.  SyncProf: detecting, localizing, and optimizing synchronization bottlenecks , 2016, ISSTA.

[24]  Mira Mezini,et al.  Da capo con scala: design and analysis of a scala benchmark suite for the java virtual machine , 2011, OOPSLA '11.

[25]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.