A grammar for spreadsheet formulas evaluated on two large datasets

Spreadsheets are ubiquitous in the industrial world and often perform a role similar to other computer programs, which makes them interesting research targets. However, there does not exist a reliable grammar that is concise enough to facilitate formula parsing and analysis and to support research on spreadsheet codebases. This paper presents a grammar for spreadsheet formulas that is compatible with the spreadsheet formula language, is compact enough to feasibly implement with a parser generator, and produces parse trees aimed at further manipulation and analysis. We evaluate the grammar against more than one million unique formulas extracted from the well known EUSES and Enron spreadsheet datasets, successfully parsing 99.99%. Additionally, we utilize the grammar to analyze these datasets and measure the frequency of usage of language features in spreadsheet formulas. Finally, we identify smelly constructs and uncommon cases in the syntax of formulas.

[1]  Arie van Deursen,et al.  Supporting professional spreadsheet users by generating leveled dataflow diagrams , 2010, 2011 33rd International Conference on Software Engineering (ICSE).

[2]  Douglas Bell,et al.  Spreadsheets: a research agenda , 1993, SIGP.

[3]  Mary Shaw,et al.  Estimating the numbers of end users and end user programmers , 2005, 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC'05).

[4]  Vadim Zaytsev,et al.  Recovery, Convergence and Documentation of Languages , 2010 .

[5]  Danny Dig,et al.  Refactoring meets spreadsheet formulas , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[6]  Yutaka Matsushita,et al.  3D interactive visualization for inter-cell dependencies of spreadsheets , 1999, Proceedings 1999 IEEE Symposium on Information Visualization (InfoVis'99).

[7]  Ralf Lämmel,et al.  Comparison of Context-Free Grammars Based on Parsing Generated Test Data , 2011, SLE.

[8]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[9]  Arie van Deursen,et al.  Detecting code smells in spreadsheet formulas , 2011, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[10]  Danny Dig,et al.  BumbleBee: a refactoring environment for spreadsheet formulas , 2014, FSE 2014.

[11]  Kevin McDaid,et al.  Using Bayesian statistical methods to determine the level of error in large spreadsheets. , 2009, 2009 31st International Conference on Software Engineering - Companion Volume.

[12]  Jácome Cunha,et al.  Smelling Faults in Spreadsheets , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[13]  Hugo Ribeiro,et al.  Towards a Catalog of Spreadsheet Smells , 2012, ICCSA.

[14]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[15]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[16]  Ralf Lämmel,et al.  An Introduction to Grammar Convergence , 2009, IFM.

[17]  Chris Verhoef,et al.  Obtaining a COBOL grammar from legacy code for reengineering purposes , 1997 .