Understanding and overcoming parallelism bottlenecks in ForkJoin applications

ForkJoin framework is a widely used parallel programming framework upon which both core concurrency libraries and real-world applications are built. Beneath its simple and user-friendly APIs, ForkJoin is a sophisticated managed parallel runtime unfamiliar to many application programmers: the framework core is a work-stealing scheduler, handles fine-grained tasks, and sustains the pressure from automatic memory management. ForkJoin poses a unique gap in the compute stack between high-level software engineering and low-level system optimization. Understanding and bridging this gap is crucial for the future of parallelism support in JVM-supported applications. This paper describes a comprehensive study on parallelism bottlenecks in ForkJoin applications, with a unique focus on how they interact with underlying system-level features, such as work stealing and memory management. We identify 6 bottlenecks, and found that refactoring them can significantly improve performance and energy efficiency. Our field study includes an in-depth analysis of Akka — a real-world actor framework — and 30 additional open-source ForkJoin projects. We sent our patches to the developers of 15 projects, and 7 out of the 9 projects that replied to our patches have accepted them.

[1]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[2]  Yu Lin,et al.  Retrofitting concurrency for Android applications through refactoring , 2014, FSE 2014.

[3]  Martin C. Rinard,et al.  Proving acceptability properties of relaxed nondeterministic approximate programs , 2012, PLDI.

[4]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[5]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[6]  Martin Odersky,et al.  Spores: A Type-Based Foundation for Closures in the Age of Concurrency and Distribution , 2014, ECOOP.

[7]  Scott F. Smith,et al.  Task types for pervasive atomicity , 2010, OOPSLA.

[8]  Gustavo Pinto,et al.  Data-Oriented Characterization of Application-Level Energy Optimization , 2015, FASE.

[9]  Gustavo Pinto,et al.  Understanding energy behaviors of thread management constructs , 2014, OOPSLA 2014.

[10]  Bruce R. Childers,et al.  Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java platform: Virtual machines, Languages, and Tools , 2014 .

[11]  Chandra Krintz,et al.  Language and Virtual Machine Support for Efficient Fine-Grained Futures in Java , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[12]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[13]  Xiaoning Ding,et al.  BWS: balanced work stealing for time-sharing multicores , 2012, EuroSys '12.

[14]  Gustavo Pinto,et al.  A Comprehensive Study on the Energy Efficiency of Java’s Thread-Safe Collections , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[15]  Yu Lin,et al.  Refactorings for Android Asynchronous Programming , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Yuanyuan Zhou,et al.  Managing energy-performance tradeoffs for multithreaded applications on multiprocessor architectures , 2007, SIGMETRICS '07.

[17]  Michael Kaminsky,et al.  Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles , 2013, SOSP 2013.

[18]  Peter Luksch,et al.  High Performance Concurrent Multi-Path Communication for MPI , 2012, EuroMPI.

[19]  Yu Lin,et al.  CHECK-THEN-ACT Misuse of Java Concurrent Collections , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[20]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[21]  Victor Pankratius,et al.  Combining functional and imperative programming for multicore software: An empirical study evaluating Scala and Java , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[22]  LuShan,et al.  Understanding and detecting real-world performance bugs , 2012 .

[23]  André L. M. Santos,et al.  A preliminary assessment of Haskell's software transactional memory constructs , 2013, SAC '13.

[24]  Martin Odersky,et al.  Scala Actors: Unifying thread-based and event-based programming , 2009, Theor. Comput. Sci..

[25]  Matthew Arnold,et al.  Go with the flow: profiling copies to find runtime bloat , 2009, PLDI '09.

[26]  Yu David Liu,et al.  Energy-efficient work-stealing language runtimes , 2014, ASPLOS.

[27]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[28]  Stefan Marr,et al.  Fork/join parallelism in the wild: documenting patterns and anti-patterns in Java programs using the fork/join framework , 2014, PPPJ.

[29]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[30]  Stelios Sidiroglou,et al.  Dancing with uncertainty , 2012, RACES '12.

[31]  Lei Wang,et al.  An adaptive task creation strategy for work-stealing scheduling , 2010, CGO '10.

[32]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[33]  Gustavo Pinto,et al.  A large-scale study on the usage of Java's concurrent programming constructs , 2015, J. Syst. Softw..

[34]  Michael D. Ernst,et al.  Refactoring sequential Java code for concurrency via concurrent libraries , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[35]  Danny Dig,et al.  How do developers use parallel libraries? , 2012, SIGSOFT FSE.

[36]  Martin Rinard,et al.  Proceedings of the ACM international conference on Object oriented programming systems languages and applications , 2010 .

[37]  Michael C. Huang,et al.  The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[38]  Yu Lin,et al.  Study and Refactoring of Android Asynchronous Programming (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).