CheckCell: data debugging for spreadsheets

Testing and static analysis can help root out bugs in programs, but not in data. This paper introduces data debugging, an approach that combines program analysis and statistical analysis to automatically find potential data errors. Since it is impossible to know a priori whether data are erroneous, data debugging instead locates data that has a disproportionate impact on the computation. Such data is either very important, or wrong. Data debugging is especially useful in the context of data-intensive programming environments that intertwine data with programs in the form of queries or formulas. We present the first data debugging tool, CheckCell, an add-in for Microsoft Excel. CheckCell identifies cells that have an unusually high impact on the spreadsheet's computations. We show that CheckCell is both analytically and empirically fast and effective. We show that it successfully finds injected typographical errors produced by a generative model trained with data entry from 169,112 Mechanical Turk tasks. CheckCell is more precise and efficient than standard outlier detection techniques. CheckCell also automatically identifies a key flaw in the infamous Reinhart and Rogoff spreadsheet.

[1]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[2]  M. Fisher,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, WEUSE@ICSE.

[3]  Yannis Papakonstantinou,et al.  Hypothetical Queries in an OLAP Environment , 2000, VLDB.

[4]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[6]  Gregg Rothermel,et al.  A methodology for testing spreadsheets , 2001, TSEM.

[7]  Kenneth S. Rogoff,et al.  Growth in a Time of Debt , 2010 .

[8]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[9]  Thomas C. Herndon,et al.  Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff , 2014 .

[10]  Gregg Rothermel,et al.  What you see is what you test: a methodology for testing form-based visual programs , 1998, Proceedings of the 20th International Conference on Software Engineering.

[11]  M. Erwig,et al.  Automatic generation and maintenance of correct spreadsheets , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[12]  Gustavo Alonso,et al.  A Pipelined Framework for Online Cleaning of Sensor Data Streams , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Peng Zhang,et al.  Statistical inference on recall, precision and average precision under random selection , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[14]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[15]  Frédéric Boniol,et al.  Robustness analysis of avionics embedded systems , 2003, LCTES.

[16]  Gregg Rothermel,et al.  An empirical evaluation of a testing and debugging methodology for Excel , 2006, ISESE '06.

[17]  Dick Hamlet,et al.  Continuity in software systems , 2002, ISSTA '02.

[18]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[19]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[20]  Mary Shaw,et al.  The state of the art in end-user software engineering , 2011, ACM Comput. Surv..

[21]  Rui Abreu,et al.  On the Empirical Evaluation of Fault Localization Techniques for Spreadsheets , 2013, FASE.

[22]  Gabriella Kuráth,et al.  Strategic Management and Decision Support Systems in Strategic Management , 2015 .

[23]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[24]  Martin Erwig Software Engineering for Spreadsheets , 2009, IEEE Software.

[25]  Sumit Gulwani,et al.  Learning Semantic String Transformations from Examples , 2012, Proc. VLDB Endow..

[26]  Martin Erwig,et al.  Reasoning about spreadsheets with labels and dimensions , 2010, J. Vis. Lang. Comput..

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Mary Shaw,et al.  Semantic anomaly detection in online data sources , 2002, ICSE '02.

[29]  Sumit Gulwani,et al.  Spreadsheet table transformations from examples , 2011, PLDI '11.

[30]  Matthias Felleisen,et al.  Validating the unit correctness of spreadsheet programs , 2004, Proceedings. 26th International Conference on Software Engineering.

[31]  Matthias Jarke,et al.  Systematic Development of Data Mining-Based Data Quality Tools , 2003, VLDB.

[32]  Nicholas J. Handley Growth in a Time of Debt , 2010 .

[33]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[34]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[35]  Gregg Rothermel,et al.  Scaling a Dataflow Testing Methodology to the MultiparadigmWorld of Commercial Spreadsheets , 2006, 2006 17th International Symposium on Software Reliability Engineering.

[36]  Shriram Krishnamurthi,et al.  A type system for statically detecting spreadsheet errors , 2003, 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..

[37]  Vipin Samar,et al.  Controlling the Information Flow in Spreadsheets , 2008, ArXiv.

[38]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[39]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[40]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.