Evaluating program analysis and testing tools with the RUGRAT random benchmark application generator

Benchmarks are heavily used in different areas of computer science to evaluate algorithms and tools. In program analysis and testing, open-source and commercial programs are routinely used as bench- marks to evaluate different aspects of algorithms and tools. Unfor- tunately, many of these programs are written by programmers who introduce different biases, not to mention that it is very difficult to find programs that can serve as benchmarks with high reproducibil- ity of results. We propose a novel approach for generating random benchmarks for evaluating program analysis and testing tools. Our approach uses stochastic parse trees, where language grammar production rules are assigned probabilities that specify the frequencies with which instantiations of these rules will appear in the generated pro- grams. We implemented our tool for Java and applied it to generate benchmarks with which we evaluated different program analysis and testing tools. Our tool was also implemented by a major soft- ware company for C++ and used by a team of developers to gener- ate benchmarks that enabled them to reproduce a bug in less than four hours.

[1]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[2]  Laurie Hendren,et al.  Dynamic metrics for java , 2003, OOPSLA 2003.

[3]  Emin Gün Sirer,et al.  Using production grammars in software testing , 1999, DSL '99.

[4]  P. Purdom A sentence generator for testing parsers , 1972 .

[5]  Alan Jay Smith,et al.  Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[6]  George McDaniel IBM dictionary of computing , 1994 .

[7]  K. V. Hanford,et al.  Automatic Generation of Test Cases , 1970, IBM Syst. J..

[8]  Sara Cohen,et al.  Querying parse trees of stochastic context-free grammars , 2010, ICDT '10.

[9]  Takahide Yoshikawa,et al.  Random program generator for Java JIT compiler test system , 2003, Third International Conference on Quality Software, 2003. Proceedings..

[10]  Xuejun Yang,et al.  Finding and understanding bugs in C compilers , 2011, PLDI '11.

[11]  Peter M. Maurer,et al.  Generating test data with enhanced context-free grammars , 1990, IEEE Software.

[12]  Michael Stepp,et al.  An empirical study of Java bytecode programs , 2007, Softw. Pract. Exp..

[13]  John R. Levine Linkers and Loaders , 1999 .

[14]  Ralf Lämmel,et al.  Controllable Combinatorial Coverage in Grammar-Based Testing , 2006, TestCom.

[15]  Jeffrey S. Foster,et al.  A comparison of bug finding tools for Java , 2004, 15th International Symposium on Software Reliability Engineering.

[16]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[17]  Amer Diwan,et al.  Wake up and smell the coffee: evaluation methodology for the 21st century , 2008, CACM.

[18]  Lieven Eeckhout,et al.  Distilling the essence of proprietary workloads into miniature benchmarks , 2008, TACO.

[19]  Hongyu Zhang,et al.  An Empirical Study of Class Sizes for Large Java Systems , 2007, 14th Asia-Pacific Software Engineering Conference (APSEC'07).

[20]  Karama Kanoun,et al.  Dependability benchmarking for computer systems , 2008 .

[21]  Xuejun Yang,et al.  Testing Static Analyzers with Randomly Generated Programs , 2012, NASA Formal Methods.

[22]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[23]  Darko Marinov,et al.  Automated testing of refactoring engines , 2007, ESEC-FSE '07.

[24]  Donald R. Slutz,et al.  Massive Stochastic Testing of SQL , 1998, VLDB.

[25]  Carlo Ghezzi,et al.  An empirical investigation into a large-scale Java open source code repository , 2010, ESEM '10.

[26]  Rupak Majumdar,et al.  Directed test generation using symbolic grammars , 2007, ASE.

[27]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[28]  Adam Kiezun,et al.  Grammar-based whitebox fuzzing , 2008, PLDI '08.