An Empirical Study on the Use and Misuse of Java 8 Streams

Streaming APIs allow for big data processing of native data structures by providing MapReduce-like operations over these structures. However, unlike traditional big data systems, these data structures typically reside in shared memory accessed by multiple cores. Although popular, this emerging hybrid paradigm opens the door to possibly detrimental behavior, such as thread contention and bugs related to non-execution and non-determinism. This study explores the use and misuse of a popular streaming API, namely, Java 8 Streams. The focus is on how developers decide whether or not to run these operations sequentially or in parallel and bugs both specific and tangential to this paradigm. Our study involved analyzing 34 Java projects and 5:53 million lines of code, along with 719 manually examined code patches. Various automated, including interprocedural static analysis, and manual methodologies were employed. The results indicate that streams are pervasive, parallelization is not widely used, and performance is a crosscutting concern that accounted for the majority of fixes. We also present coincidences that both confirm and contradict the results of related studies. The study advances our understanding of streams, as well as benefits practitioners, programming language and API designers, tool developers, and educators alike.

[1]  Edna Dias Canedo,et al.  Does the Introduction of Lambda Expressions Improve the Comprehension of Java Programs? , 2019, SBES.

[2]  Baishakhi Ray,et al.  Automatically diagnosing and repairing error handling bugs in C , 2017, ESEC/SIGSOFT FSE.

[3]  Danny Dig,et al.  Understanding the use of lambda expressions in Java , 2017, Proc. ACM Program. Lang..

[4]  Mohamed Wiem Mkaouer,et al.  On the classification of software change messages using multi-label active learning , 2019, SAC.

[5]  Premkumar T. Devanbu,et al.  Assert Use in GitHub Projects , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Richard Warburton,et al.  Java 8 Lambdas: Pragmatic Functional Programming , 2014 .

[7]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[8]  Ying Li,et al.  Performance under Failures of MapReduce Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[9]  Eran Yahav,et al.  Effective typestate verification in the presence of aliasing , 2006, TSEM.

[10]  Leon Moonen,et al.  An Integrated Crosscutting Concern Migration Strategy and its Application to JHOTDRAW , 2007, Seventh IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007).

[11]  Hridesh Rajan,et al.  Mining billions of AST nodes to study actual and potential usage of Java language features , 2014, ICSE.

[12]  Robert Heumüller,et al.  Programmers do not favor lambda expressions for concurrent object-oriented code , 2018, Empirical Software Engineering.

[13]  David Lo,et al.  Revisiting Assert Use in GitHub Projects , 2017, EASE.

[14]  Stefan Hanenberg,et al.  An Empirical Study on the Impact of C++ Lambdas and Programmer Experience , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[15]  Danny Dig,et al.  Type Migration in Ultra-Large-Scale Codebases , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[16]  Mehdi Bagherzadeh,et al.  Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[17]  Wenguang Chen,et al.  Nondeterminism in MapReduce considered harmful? an empirical study on non-commutative aggregators in MapReduce programs , 2014, ICSE Companion.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Haoxiang Lin,et al.  An Empirical Study on Quality Issues of Production Big Data Platform , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[20]  Yannis Smaragdakis,et al.  Streams a la carte: Extensible Pipelines with Object Algebras , 2015, ECOOP.

[21]  Raffi Khatchadourian,et al.  Going big: a large-scale study on what big data developers ask , 2019, ESEC/SIGSOFT FSE.

[22]  Baishakhi Ray,et al.  GitcProc: a tool for processing and classifying GitHub commits , 2017, ISSTA.

[23]  Other Contributors Are Indicated Where They Contribute The Eclipse Foundation , 2017 .

[24]  Hidehiko Masuhara,et al.  Proactive Empirical Assessment of New Language Feature Adoption via Automated Refactoring: The Case of Java 8 Default Methods , 2018, Art Sci. Eng. Program..

[25]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[26]  Hridesh Rajan,et al.  Order types: static reasoning about message races in asynchronous message passing concurrency , 2017, AGERE!@SPLASH.

[27]  Shubham Sangle,et al.  On the use of lambda expressions in 760 open source Python projects , 2019, ESEC/SIGSOFT FSE.

[28]  Mehdi Bagherzadeh,et al.  [Engineering Paper] A Tool for Optimizing Java 8 Stream Software via Automated Refactoring , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[29]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[30]  Emerson R. Murphy-Hill,et al.  Adoption and use of Java generics , 2012, Empirical Software Engineering.

[31]  Mehdi Bagherzadeh,et al.  What do concurrency developers ask about?: a large-scale study using stack overflow , 2018, ESEM.

[32]  Robert E. Strom,et al.  Typestate: A programming language concept for enhancing software reliability , 1986, IEEE Transactions on Software Engineering.

[33]  Hidehiko Masuhara,et al.  Automated Refactoring of Legacy Java Software to Default Methods , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).