Bottle graphs: visualizing scalability bottlenecks in multi-threaded applications

Understanding and analyzing multi-threaded program performance and scalability is far from trivial, which severely complicates parallel software development and optimization. In this paper, we present bottle graphs, a powerful analysis tool that visualizes multi-threaded program performance, in regards to both per-thread parallelism and execution time. Each thread is represented as a box, with its height equal to the share of that thread in the total program execution time, its width equal to its parallelism, and its area equal to its total running time. The boxes of all threads are stacked upon each other, leading to a stack with height equal to the total program execution time. Bottle graphs show exactly how scalable each thread is, and thus guide optimization towards those threads that have a smaller parallel component (narrower), and a larger share of the total execution time (taller), i.e. to the 'neck' of the bottle. Using light-weight OS modules, we calculate bottle graphs for unmodified multi-threaded programs running on real processors with an average overhead of 0.68%. To demonstrate their utility, we do an extensive analysis of 12 Java benchmarks running on top of the Jikes JVM, which introduces many JVM service threads. We not only reveal and explain scalability limitations of several well-known Java benchmarks; we also analyze the reasons why the garbage collector itself does not scale, and in fact performs optimally with two collector threads for all benchmarks, regardless of the number of application threads. Finally, we compare the scalability of Jikes versus the OpenJDK JVM. We demonstrate how useful and intuitive bottle graphs are as a tool to analyze scalability and help optimize multi-threaded applications.

[1]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[2]  Simha Sethumadhavan,et al.  Rapid identification of architectural bottlenecks via precise event counting , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[3]  Bowen Alpern,et al.  Implementing jalapeño in Java , 1999, OOPSLA '99.

[4]  Melanie Kambadur,et al.  Harmony: Collection and analysis of parallel block vectors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[5]  Jan Vitek,et al.  A black-box approach to understanding concurrency in DaCapo , 2012, OOPSLA '12.

[6]  J. Morris Chang,et al.  Multithreading in Java: Performance and Scalability on Multicore Systems , 2011, IEEE Transactions on Computers.

[7]  Marc Shapiro,et al.  Assessing the scalability of garbage collectors on many cores , 2011, PLOS '11.

[8]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[9]  Kathryn S. McKinley,et al.  Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance , 2008, PLDI '08.

[10]  Peter Kulchyski and , 2015 .

[11]  Erik R. Altman,et al.  Performance analysis of idle programs , 2010, OOPSLA.

[12]  Marc Shapiro,et al.  A study of the scalability of stop-the-world garbage collectors on multicores , 2013, ASPLOS '13.

[13]  Lizy Kurian John,et al.  More on finding a single number to indicate overall performance of a benchmark suite , 2004, CARN.

[14]  Stijn Eyerman,et al.  Criticality stacks: identifying critical threads in parallel programs using synchronization behavior , 2013, ISCA.

[15]  Lieven Eeckhout,et al.  Exploring multi-threaded Java application performance on multicore hardware , 2012, OOPSLA '12.

[16]  Marty Itzkowitz,et al.  HPC Profiling with the Sun Studio™ Performance Tools , 2009, Parallel Tools Workshop.

[17]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.