Large-Scale Evaluation of the Efficiency of Runtime-Verification Tools in the Wild

Runtime verification (RV) is a field of study which suffers from a lack of dedicated benchmarks. Many published evaluations of RV tools rely on workloads which are not representative of real-world programs. In this paper, we present a methodology to automatically discover relevant open-source projects for evaluating RV tools. This is done by analyzing unit tests on a large number of projects hosted on GitHub. Our evaluation shows that analyzing a large number of open-source projects—instead of a handful of manually selected workloads—provides better insight into the behavior of three state-of-the-art RV tools (JavaMOP, MarQ, and Muffin) based on two metrics (memory utilization and runtime overhead). By monitoring test executions of a large number of projects, we show that none of the evaluated RV tools wins for both metrics.

[1]  Gordon J. Pace,et al.  Combining Testing and Runtime Verification Techniques , 2012, MOMPES.

[2]  Grigore Rosu,et al.  JavaMOP: Efficient parametric runtime monitoring framework , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[3]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[4]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[5]  Yuanyuan Zhou,et al.  BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .

[6]  David E. Rydeheard,et al.  MarQ: Monitoring at Runtime with QEA , 2015, TACAS.

[7]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[8]  Normann Decker,et al.  Runtime Monitoring with Union-Find Structures , 2016, TACAS.

[9]  Yi Zhang,et al.  RV-Monitor: Efficient Parametric Runtime Verification with Simultaneous Properties , 2014, RV.

[10]  Haiyang Sun,et al.  AutoBench: Finding Workloads That You Need Using Pluggable Hybrid Analyses , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[11]  Eran Yahav,et al.  QVM: an efficient runtime for detecting defects in deployed systems , 2008, OOPSLA '08.

[12]  Danny Dig,et al.  Understanding the use of lambda expressions in Java , 2017, Proc. ACM Program. Lang..

[13]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[14]  Premkumar T. Devanbu,et al.  A large scale study of programming languages and code quality in github , 2014, SIGSOFT FSE.

[15]  Jianjun Zhao,et al.  JaConTeBe: A Benchmark Suite of Real-World Java Concurrency Bugs (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Choonghwan Lee,et al.  Towards Categorizing and Formalizing the JDK API , 2012 .

[17]  Wang Zhi-jian Using Benchmarking to Advance Research:A Challenge to Software Engineering , 2005 .

[18]  Grigore Rosu,et al.  How good are the specs? A study of the bug-finding effectiveness of existing Java API specifications , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[20]  Ezio Bartocci,et al.  First international Competition on Runtime Verification: rules, benchmarks, tools, and final results of CRV 2014 , 2017, International Journal on Software Tools for Technology Transfer.

[21]  Mira Mezini,et al.  Ieee Transactions on Software Engineering 1 Automated Api Property Inference Techniques , 2022 .