Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets

Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called Fuse, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of Fuse with other corpora.

[1]  Danny Dig,et al.  Refactoring meets spreadsheet formulas , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[2]  RothermelGregg,et al.  The EUSES spreadsheet corpus , 2005 .

[3]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[4]  Alan F. Blackwell,et al.  First steps in programming: a rationale for attention investment models , 2002, Proceedings IEEE 2002 Symposia on Human Centric Computing Languages and Environments.

[5]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[6]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[7]  Bonnie A. Nardi,et al.  The spreadsheet interface: A basis for end user programming , 1990, IFIP TC13 International Conference on Human-Computer Interaction.

[8]  Margaret M. Burnett What Is End-User Software Engineering and Why Does It Matter? , 2009, IS-EUD.

[9]  Mary Shaw,et al.  The state of the art in end-user software engineering , 2011, ACM Comput. Surv..

[10]  Arie van Deursen,et al.  Detecting code smells in spreadsheet formulas , 2011, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[11]  Mary Shaw,et al.  Estimating the numbers of end users and end user programmers , 2005, 2005 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC'05).

[12]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..