Enron's Spreadsheets and Related Emails: A Dataset and Analysis

Spreadsheets are used extensively in business processes around the world and as such, are a topic of research interest. Over the past few years, many spreadsheet studies have been performed on the EUSES spreadsheet corpus. While this corpus has served the spreadsheet community well, the spreadsheets it contains are mainly gathered with search engines and might therefore not represent spreadsheets used in companies. This paper presents an analysis of a new dataset, extracted from the Enron email archive, containing over 15,000 spreadsheets used within the Enron Corporation. In addition to the spreadsheets, we also present an analysis of the associated emails, where we look into spreadsheet-specific email behavior. Our analysis shows that 1) 24% of Enron spreadsheets with at least one formula contain an Excel error, 2) there is little diversity in the functions used in spreadsheets: 76% of spreadsheets in the presented corpus use the same 15 functions and, 3) the spreadsheets are substantially more smelly than the EUSES corpus, especially in terms of long calculation chains. Regarding the emails, we observe that spreadsheets 1) are a frequent topic of email conversation with 10% of emails either referring to or sending spreadsheets and 2) the emails are frequently discussing errors in and updates to spreadsheets.

[1]  Andrej Bregar Complexity Metrics for Spreadsheet Models , 2008, ArXiv.

[2]  Bonnie A. Nardi,et al.  The spreadsheet interface: A basis for end user programming , 1990, IFIP TC13 International Conference on Human-Computer Interaction.

[3]  Hugo Ribeiro,et al.  Towards a Catalog of Spreadsheet Smells , 2012, ICCSA.

[4]  Kevin McDaid,et al.  Effect of Range Naming Conventions on Reliability and Development Time for Simple Spreadsheet Formulas , 2011, ArXiv.

[5]  Raymond R. Panko,et al.  Two corpuses of spreadsheet errors , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[6]  John F. Raffensperger New Guidelines For Spreadsheets , 2008, ArXiv.

[7]  Arie van Deursen,et al.  Supporting professional spreadsheet users by generating leveled dataflow diagrams , 2010, 2011 33rd International Conference on Software Engineering (ICSE).

[8]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[9]  Felienne Hermans,et al.  Analyzing and Visualizing Spreadsheets , 2013 .

[10]  Martin Erwig,et al.  Automatic detection of dimension errors in spreadsheets , 2009, J. Vis. Lang. Comput..

[11]  Raymond R. Panko,et al.  What we know about spreadsheet errors , 1998 .

[12]  Duncan McPhee,et al.  Mining Spreadsheet Complexity Data to Classify End User Developers , 2009, DMIN.

[13]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[14]  Mary Shaw,et al.  Estimating the numbers of end users and end user programmers , 2005, 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC'05).

[15]  Kevin McDaid,et al.  How do Range Names Hinder Novice Spreadsheet Debugging Performance? , 2010, ArXiv.

[16]  R. Abraham,et al.  How to communicate unit error messages in spreadsheets , 2005, WEUSE@ICSE.

[17]  Stéphane Ducasse,et al.  Object-Oriented Metrics in Practice , 2005 .

[18]  Martin Erwig,et al.  UCheck: A spreadsheet type checker for end users , 2007, J. Vis. Lang. Comput..

[19]  Arie van Deursen,et al.  Automatically Extracting Class Diagrams from Spreadsheets , 2010, ECOOP.

[20]  Brian Knight,et al.  Quality control in spreadsheets: a software engineering-based approach to spreadsheet development , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[21]  Daniela Cruzes,et al.  The evolution and impact of code smells: A case study of two open source systems , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[22]  Arie van Deursen,et al.  Detecting code smells in spreadsheet formulas , 2011, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[23]  Martin Erwig Software Engineering for Spreadsheets , 2009, IEEE Software.

[24]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[25]  Stephen G. Powell,et al.  Errors in Operational Spreadsheets , 2009, J. Organ. End User Comput..

[26]  Felienne Hermans Improving spreadsheet test practices , 2013, CASCON.

[27]  Roland Mittermeir,et al.  Metrics-Based Spreadsheet Visualization: Support for Focused Maintenance , 2008, ArXiv.

[28]  Roland Mittermeir,et al.  Detecting Errors in Spreadsheets , 2008, ArXiv.

[29]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[30]  Kevin McDaid,et al.  An Exploratory Analysis of the Impact of Named Ranges on the Debugging Performance of Novice Users , 2009, ArXiv.

[31]  Jun Wei,et al.  Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation , 2014, ICSE.

[32]  Tiago L. Alves,et al.  Deriving metric thresholds from benchmark data , 2010, 2010 IEEE International Conference on Software Maintenance.

[33]  Martin Erwig,et al.  GoalDebug: A Spreadsheet Debugger for End Users , 2007, 29th International Conference on Software Engineering (ICSE'07).

[34]  Arie van Deursen,et al.  Detecting and refactoring code smells in spreadsheet formulas , 2013, Empirical Software Engineering.

[35]  Stephen G. Powell,et al.  Errors in Operational Spreadsheets: A Review of the State of the Art , 2009 .

[36]  Radu Marinescu,et al.  Detecting design flaws via metrics in object-oriented systems , 2001, Proceedings 39th International Conference and Exhibition on Technology of Object-Oriented Languages and Systems. TOOLS 39.

[37]  Y. Chauhan,et al.  Growth in a Time of Debt , 2015 .