RUGRAT: Evaluating program analysis and testing tools and compilers with large generated random benchmark applications

Benchmarks are heavily used in different areas of computer science to evaluate algorithms and tools. In program analysis and testing, open‐source and commercial programs are routinely used as benchmarks to evaluate different aspects of algorithms and tools. Unfortunately, many of these programs are written by programmers who introduce different biases, not to mention that it is very difficult to find programs that can serve as benchmarks with high reproducibility of results. We propose a novel approach for generating random benchmarks for evaluating program analysis and testing tools and compilers. Our approach uses stochastic parse trees, where language grammar production rules are assigned probabilities that specify the frequencies with which instantiations of these rules will appear in the generated programs. We implemented our tool for Java and applied it to generate a set of large benchmark programs of up to 5M lines of code each with which we evaluated different program analysis and testing tools and compilers. The generated benchmarks let us independently rediscover several issues in the evaluated tools. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  Jan Jürjens,et al.  Comparing Bug Finding Tools with Reviews and Tests , 2005, TestCom.

[2]  R. K. Shyamasundar,et al.  A sentence generator for a compiler for PT, a pascal subset , 1983, Softw. Pract. Exp..

[3]  Hongyu Zhang,et al.  An Empirical Study of Class Sizes for Large Java Systems , 2007, 14th Asia-Pacific Software Engineering Conference (APSEC'07).

[4]  Liang Guo,et al.  Automated test program generation for an industrial optimizing compiler , 2009, 2009 ICSE Workshop on Automation of Software Test.

[5]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6]  Ramesh Raskar,et al.  Image-based visual hulls , 2000, SIGGRAPH.

[7]  Laurie Hendren,et al.  Dynamic metrics for java , 2003, OOPSLA 2003.

[8]  Carlo Ghezzi,et al.  An empirical investigation into a large-scale Java open source code repository , 2010, ESEM '10.

[9]  K. Sreenivasan,et al.  On the construction of a representative synthetic workload , 1974, CACM.

[10]  Gordon Fraser,et al.  Sound empirical evidence in software testing , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[11]  James M. Bieman,et al.  The Effectiveness of Automated Static Analysis Tools for Fault Detection and Refactoring Prediction , 2009, 2009 International Conference on Software Testing Verification and Validation.

[12]  Chen Fu,et al.  CarFast: achieving higher statement coverage faster , 2012, SIGSOFT FSE.

[13]  George McDaniel IBM dictionary of computing , 1994 .

[14]  Thomas Zimmermann,et al.  Extraction of bug localization benchmarks from history , 2007, ASE.

[15]  Timo Mantere,et al.  Automatic image generation by genetic algorithms for testing halftoning methods , 2000, SPIE Optics East.

[16]  Mark Hennessy,et al.  Analysing the effectiveness of rule-coverage as a reduction criterion for test suites of grammar-based software , 2008, Empirical Software Engineering.

[17]  Ralf Lämmel,et al.  Comparison of Context-Free Grammars Based on Parsing Generated Test Data , 2011, SLE.

[18]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[19]  Christopher J. Fox,et al.  Securing Java code: heuristics and an evaluation of static analysis tools , 2008, SAW '08.

[20]  Dawson R. Engler,et al.  Some Lessons from Using Static Analysis and Software Model Checking for Bug Finding , 2003, SoftMC@CAV.

[21]  Emin Gün Sirer,et al.  Using production grammars in software testing , 1999, DSL '99.

[22]  Karel Culik,et al.  Affine Automata: A Technique to Generate Complex Images , 1990, MFCS.

[23]  P. Purdom A sentence generator for testing parsers , 1972 .

[24]  Bill Martin,et al.  A Java program to create simulated microarray images , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[25]  Stefano Mizzaro,et al.  How many relevances in information retrieval? , 1998, Interact. Comput..

[26]  Michael D. Ernst,et al.  Randoop: feedback-directed random testing for Java , 2007, OOPSLA '07.

[27]  Takahide Yoshikawa,et al.  Random program generator for Java JIT compiler test system , 2003, Third International Conference on Quality Software, 2003. Proceedings..

[28]  Stefano Mizzaro Relevance: the whole history , 1997 .

[29]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[30]  Nikolai Tillmann,et al.  An empirical study of testing file-system-dependent software with mock objects , 2009, 2009 ICSE Workshop on Automation of Software Test.

[31]  Colin J. Burgess The automated generation of test cases for compilers , 1994, Softw. Test. Verification Reliab..

[32]  Chen Fu,et al.  Evaluating program analysis and testing tools with the RUGRAT random benchmark application generator , 2012, WODA 2012.

[33]  Hassan K. Reghbati,et al.  Computer graphics hardware - image generation and display: tutorial , 1988 .

[34]  Marcelo d'Amorim,et al.  An Empirical Comparison of Automated Generation and Classification Techniques for Object-Oriented Unit Testing , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[35]  Richard Lippmann,et al.  Testing static analysis tools using exploitable buffer overflows from open source code , 2004, SIGSOFT '04/FSE-12.

[36]  Xuejun Yang,et al.  Finding and understanding bugs in C compilers , 2011, PLDI '11.

[37]  Dawson R. Engler,et al.  Static Analysis versus Software Model Checking for Bug Finding , 2004, VMCAI.

[38]  Mira Mezini,et al.  Da capo con scala: design and analysis of a scala benchmark suite for the java virtual machine , 2011, OOPSLA '11.

[39]  Nicolas Anquetil,et al.  Assessing the relevance of identifier names in a legacy software system , 1998, CASCON.

[40]  Amer Diwan,et al.  Wake up and smell the coffee: evaluation methodology for the 21st century , 2008, CACM.

[41]  Laurie A. Williams,et al.  One Technique is Not Enough: A Comparison of Vulnerability Discovery Techniques , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[42]  Lieven Eeckhout,et al.  Distilling the essence of proprietary workloads into miniature benchmarks , 2008, TACO.

[43]  E. Sirer Testing Java Virtual Machines An Experience Report on Automatically Testing Java Virtual Machines , 1999 .

[44]  Collin McMillan,et al.  Detecting similar software applications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[45]  Jehad Al Dallal,et al.  Testing Syntax and Semantic Coverage of Java Language Compilers , 1999, Inf. Softw. Technol..

[46]  Alan Jay Smith,et al.  Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[47]  Andreas Sewe,et al.  Design and analysis of a scala benchmark suite for the Java virtual machine , 2012 .

[48]  Karama Kanoun,et al.  Dependability benchmarking for computer systems , 2008 .

[49]  Shuang Wang,et al.  Comparison of Unit-Level Automated Test Generation Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[50]  Xuejun Yang,et al.  Testing Static Analyzers with Randomly Generated Programs , 2012, NASA Formal Methods.

[51]  Koushik Sen,et al.  Universal symbolic execution and its application to likely data structure invariant generation , 2008, ISSTA '08.

[52]  Paul A. Strooper,et al.  Grammar‐based test generation with YouGen , 2011, Softw. Pract. Exp..

[53]  Suzanna Schmeelk,et al.  Towards a unified fault-detection benchmark , 2010, PASTE '10.

[54]  Donald R. Slutz,et al.  Massive Stochastic Testing of SQL , 1998, VLDB.

[55]  K. V. Hanford,et al.  Automatic Generation of Test Cases , 1970, IBM Syst. J..

[56]  Carlo Ghezzi,et al.  Compiler testing using a sentence generator , 1980, Softw. Pract. Exp..

[57]  Sara Cohen,et al.  Querying parse trees of stochastic context-free grammars , 2010, ICDT '10.

[58]  David Coppit,et al.  yagg: an easy-to-use generator for structured test inputs , 2005, ASE.

[59]  Michael D. Ernst,et al.  Feedback-Directed Random Test Generation , 2007, 29th International Conference on Software Engineering (ICSE'07).

[60]  Peter M. Maurer,et al.  Generating test data with enhanced context-free grammars , 1990, IEEE Software.

[61]  Michael Stepp,et al.  An empirical study of Java bytecode programs , 2007, Softw. Pract. Exp..

[62]  Cyrille Artho,et al.  Applying Jlint to Space Exploration Software , 2004, VMCAI.

[63]  Ralf Lämmel,et al.  Two-dimensional Approximation Coverage , 2000, Informatica.

[64]  Rupak Majumdar,et al.  Directed test generation using symbolic grammars , 2007, ASE.

[65]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[66]  Ralf Lämmel,et al.  Controllable Combinatorial Coverage in Grammar-Based Testing , 2006, TestCom.

[67]  Jeffrey S. Foster,et al.  A comparison of bug finding tools for Java , 2004, 15th International Symposium on Software Reliability Engineering.

[68]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[69]  Meikel Pöss,et al.  Generating Thousand Benchmark Queries in Seconds , 2004, VLDB.

[70]  Jin Zhang,et al.  Survey on Simplified Olfactory Bionic Model to Generate Texture Images , 2012, ICONIP.

[71]  Wojciech Matusik,et al.  Creating and Rendering Image-Based Visual Hulls , 1999 .

[72]  Abdulazeez S. Boujarwah,et al.  Compiler test case generation methods: a survey and assessment , 1997, Inf. Softw. Technol..

[73]  Sarfraz Khurshid,et al.  Automated SQL query generation for systematic testing of database engines , 2010, ASE.

[74]  Benjamin J. Evans,et al.  The Well-Grounded Java Developer: Vital techniques of Java 7 and polyglot programming , 2012 .

[75]  Nikolai Tillmann,et al.  Precise identification of problems for structural test generation , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[76]  Darko Marinov,et al.  Automated testing of refactoring engines , 2007, ESEC-FSE '07.