Finding errors in the Enron spreadsheet corpus

Spreadsheet environments like MS Excel are the most widespread type of end-user software development tools and spreadsheet-based applications can be found almost everywhere in organizations. Since spreadsheets are prone to error, several approaches were proposed in the research literature to help users locate formula errors. However, the proposed methods were often designed based on assumptions about the nature of errors and were evaluated with mutations of correct spreadsheets. In this work we propose a method and tool to identify realworld formula errors within the Enron spreadsheet corpus. Our approach is based on heuristics that help us identify versions of the same spreadsheet and our software helps the user identify spreadsheets of which we assume that they contain error corrections. An initial manual inspection of a subset of such candidates led to the identification of more than two dozen formula errors. We publicly share the new collection of real-world spreadsheet errors.

[1]  Franz Wotawa,et al.  Avoiding, finding and fixing spreadsheet errors - A survey of automated approaches for spreadsheet QA , 2014, J. Syst. Softw..

[2]  Dietmar Jannach,et al.  Model-based diagnosis of spreadsheet programs: a constraint-based debugging approach , 2016, Automated Software Engineering.

[3]  Arie van Deursen,et al.  Supporting professional spreadsheet users by generating leveled dataflow diagrams , 2010, 2011 33rd International Conference on Software Engineering (ICSE).

[4]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[5]  Martin Erwig,et al.  AutoTest: A Tool for Automatic Test Case Generation in Spreadsheets , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[6]  Raymond R. Panko Are Two Heads Better than One (At Reducing Spreadsheet Errors) , 1996 .

[7]  Glencora Borradaile,et al.  Planted-model evaluation of algorithms for identifying differences between spreadsheets , 2012, 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[8]  Daniel Port,et al.  End User Computing: The Dark Matter (and Dark Energy) of Corporate IT , 2012, HICSS.

[9]  Raymond R. Panko,et al.  What we know about spreadsheet errors , 1998 .

[10]  Raymond R. Panko,et al.  Spreadsheets on trial: a survey of research on spreadsheet risks , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[11]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[12]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[13]  Martin Erwig,et al.  Mutation Operators for Spreadsheets , 2009, IEEE Transactions on Software Engineering.

[14]  Raymond R. Panko,et al.  Are Two Heads Better than One? (At Reducing Errors in Spreadsheet Modeling). , 1997 .

[15]  Raymond R. Panko,et al.  The Detection of Human Spreadsheet Errors by Humans versus Inspection (Auditing) Software , 2010, ArXiv.

[16]  Martin Erwig,et al.  SheetDiff: A Tool for Identifying Changes in Spreadsheets , 2010, 2010 IEEE Symposium on Visual Languages and Human-Centric Computing.