VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis

Like most conventional software, spreadsheets are subject to software evolution. However, spreadsheet evolution is rarely assisted by version management tools. As a result, the version information across evolved spreadsheets is often missing or highly fragmented. This makes it difficult for users to notice the evolution issues arising from their spreadsheets. In this paper, we propose a semi-automated approach that leverages spreadsheets’ contexts (e.g., attached emails) and contents to identify evolved spreadsheets and recover the embedded version information. We apply it to the released email archive of the Enron Corporation and build an industrial-scale, versioned spreadsheet corpus VEnron. Our approach first clusters spreadsheets that likely evolved from one to another into evolution groups based on various fragmented information, such as spreadsheet filenames, spreadsheet contents, and spreadsheet-attached emails. Then, it recovers the version information of the spreadsheets in each evolution group. VEnron enables us to identify interesting issues that can arise from spreadsheet evolution. For example, the versioned spreadsheets popularly exist in the Enron email archive; changes in formulas are common; and some groups (16.9%) can introduce new errors during evolution. According to our knowledge, VEnron is the first spreadsheet corpus with version information. It provides a valuable resource to understand issues arising from spreadsheet evolution.

[1]  References , 1971 .

[2]  Gregg Rothermel,et al.  What you see is what you test: a methodology for testing form-based visual programs , 1998, Proceedings of the 20th International Conference on Software Engineering.

[3]  Gregg Rothermel,et al.  Slicing spreadsheets: an integrated methodology for spreadsheet testing and debugging , 1999, DSL '99.

[4]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[5]  Mary Shaw,et al.  Estimating the numbers of end users and end user programmers , 2005, 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC'05).

[6]  M. Fisher,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[7]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[8]  Martin Erwig,et al.  AutoTest: A Tool for Automatic Test Case Generation in Spreadsheets , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[9]  Martin Erwig,et al.  GoalDebug: A Spreadsheet Debugger for End Users , 2007, 29th International Conference on Software Engineering (ICSE'07).

[10]  Martin Erwig,et al.  UCheck: A spreadsheet type checker for end users , 2007, J. Vis. Lang. Comput..

[11]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[12]  Brian Knight,et al.  Classification of Spreadsheet Errors , 2008, ArXiv.

[13]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[14]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[15]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[16]  Raymond R. Panko,et al.  Revising the Panko-Halverson taxonomy of spreadsheet errors , 2008, Decis. Support Syst..

[17]  Martin Erwig,et al.  Reasoning about spreadsheets with labels and dimensions , 2010, J. Vis. Lang. Comput..

[18]  Arie van Deursen,et al.  Supporting professional spreadsheet users by generating leveled dataflow diagrams , 2010, 2011 33rd International Conference on Software Engineering (ICSE).

[19]  Hoan Anh Nguyen,et al.  Clone Management for Evolving Software , 2012, IEEE Transactions on Software Engineering.

[20]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[21]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[22]  Jaechang Nam,et al.  Automatic patch generation learned from human-written patches , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Arie van Deursen,et al.  Data clone detection and visualization in spreadsheets , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[24]  Jun Wei,et al.  Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation , 2014, ICSE.

[25]  Felienne Hermans,et al.  Code smells in spreadsheet formulas revisited on an industrial dataset , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[26]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[27]  Patrick Durusau,et al.  Spreadsheets - 90+ million End User Programmers With No Comment Tracking or Version Control , 2015 .

[28]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[29]  Wanjun Chen,et al.  CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).