Enron versus EUSES: A Comparison of Two Spreadsheet Corpora

Spreadsheets are widely used within companies and often form the basis for business decisions. Numerous cases are known where incorrect information in spreadsheets lead to incorrect decisions. Such cases underline the relevance of research on the professional use of spreadsheets. Recently a new dataset became available for research, containing over 15.000 business spreadsheets that were extracted from the Enron E-mail Archive. With this dataset, we 1) aim to obtain a thorough understanding of the characteristics of spreadsheets used within companies, and 2) compare the characteristics of the Enron spreadsheets with the EUSES corpus which is the existing state of the art set of spreadsheets that is frequently used in spreadsheet studies. Our analysis shows that 1) the majority of spreadsheets are not large in terms of worksheets and formulas, do not have a high degree of coupling, and their formulas are relatively simple; 2) the spreadsheets from the EUSES corpus are, with respect to the measured characteristics, quite similar to the Enron spreadsheets.

[1]  Mauricio A. Saca Refactoring improving the design of existing code , 2017, 2017 IEEE 37th Central America and Panama Convention (CONCAPAN XXXVII).

[2]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[3]  Felienne Hermans,et al.  Using a visual language to create better spreadsheets , 2014 .

[4]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[5]  Arie van Deursen,et al.  Supporting professional spreadsheet users by generating leveled dataflow diagrams , 2010, 2011 33rd International Conference on Software Engineering (ICSE).

[6]  Raymond R. Panko,et al.  Two corpuses of spreadsheet errors , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[7]  Raymond R. Panko,et al.  Sarbanes-Oxley: What About all the Spreadsheets? , 2008, ArXiv.

[8]  Jun Wei,et al.  Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation , 2014, ICSE.

[9]  Felienne Hermans,et al.  Analyzing and Visualizing Spreadsheets , 2013 .

[10]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[11]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[12]  Martin Erwig,et al.  GoalDebug: A Spreadsheet Debugger for End Users , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Jorma Sajaniemi Modeling Spreadsheet Audit: A Rigorous Approach to Automatic Visualization , 2000, J. Vis. Lang. Comput..

[14]  Andrej Bregar Complexity Metrics for Spreadsheet Models , 2008, ArXiv.

[15]  Marco Tulio Valente,et al.  Documenting APIs with examples: Lessons learned with the APIMiner platform , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[16]  Roland Mittermeir,et al.  Metrics-Based Spreadsheet Visualization: Support for Focused Maintenance , 2008, ArXiv.

[17]  Raymond R. Panko,et al.  What we know about spreadsheet errors , 1998 .

[18]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[19]  Arie van Deursen,et al.  Detecting and refactoring code smells in spreadsheet formulas , 2013, Empirical Software Engineering.