WARDER: Towards effective spreadsheet defect detection by validity-based cell cluster refinements

Abstract Nowadays spreadsheets are very popular and being widely used. However, they can be prone to various defects and cause severe consequences when end users poorly maintain them. Our research communities have proposed various techniques for automated detection of spreadsheet defects, but they commonly fall short of effectiveness, either due to their limited scope or relying on strict patterns. In this article, we discuss and improve one state-of-the-art technique, CUSTODES, which exploits spreadsheet cell clustering and defect detection to extend its scope and make its detection patterns adaptive to varying spreadsheet styles. Still, CUSTODES can be prone to problematic clustering when accidentally involving irrelevant cells, leading to a largely reduced detection precision. Regarding this, we present WARDER to refine CUSTODES’s spreadsheet cell clustering based on three extensible validity-based properties. Experimental results show that WARDER could improve the precision by 19.1% on spreadsheet cell clustering, which contributed to a precision improvement of 23.3 ~ 24.3% for spreadsheet defect detection, as compared to CUSTODES (F-measure increased from 0.71 to 0.79 ~ 0.82). WARDER also exhibited satisfactory results on another practical large-scale spreadsheet corpus VEnron2, improving the defect detection precision by 10.7 ~ 21.2% over CUSTODES.

[1]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[2]  Jorma Sajaniemi Modeling Spreadsheet Audit: A Rigorous Approach to Automatic Visualization , 2000, J. Vis. Lang. Comput..

[3]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[4]  Raymond R. Panko,et al.  Revising the Panko-Halverson taxonomy of spreadsheet errors , 2008, Decis. Support Syst..

[5]  Emery D. Berger,et al.  ExceLint: automatically finding spreadsheet formula errors , 2018, Proc. ACM Program. Lang..

[6]  Jun Wei,et al.  Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation , 2014, ICSE.

[7]  Arie van Deursen,et al.  Data clone detection and visualization in spreadsheets , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[8]  Kenneth S. Rogoff,et al.  Growth in a Time of Debt , 2010 .

[9]  Franz Wotawa,et al.  On the Refinement of Spreadsheet Smells by means of Structure Information , 2019, J. Syst. Softw..

[10]  Roland Mittermeir,et al.  Finding high-level structures in spreadsheet programs , 2002, Ninth Working Conference on Reverse Engineering, 2002. Proceedings..

[11]  Hugo Ribeiro,et al.  Towards a Catalog of Spreadsheet Smells , 2012, ICCSA.

[12]  Martin Fowler,et al.  Refactoring - Improving the Design of Existing Code , 1999, Addison Wesley object technology series.

[13]  Stephen G. Powell,et al.  A comparison of spreadsheet users with different levels of experience , 2009 .

[14]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[15]  Rui Abreu,et al.  On the Empirical Evaluation of Fault Localization Techniques for Spreadsheets , 2013, FASE.

[16]  Jácome Cunha,et al.  FaultySheet Detective: When Smells Meet Fault Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[17]  Jun Wei,et al.  Detecting table clones and smells in spreadsheets , 2016, SIGSOFT FSE.

[18]  Gregor Engels,et al.  Systematic evolution of model-based spreadsheet applications , 2012, J. Vis. Lang. Comput..

[19]  Luc De Raedt,et al.  Learning constraints in spreadsheets and tabular data , 2017, Machine Learning.

[20]  Danny Dig,et al.  Refactoring meets spreadsheet formulas , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[21]  Chang Xu,et al.  CACheck: Detecting and Repairing Cell Arrays in Spreadsheets , 2017, IEEE Transactions on Software Engineering.

[22]  Bo Yang,et al.  Detecting faulty empty cells in spreadsheets , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[23]  Glencora Borradaile,et al.  Planted-model evaluation of algorithms for identifying differences between spreadsheets , 2012, 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[24]  Kenneth N. Berk,et al.  Data Analysis With Microsoft Excel , 2000 .

[25]  Jie Wang,et al.  SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[26]  Dennis F. Galletta,et al.  An empirical study of spreadsheet error-finding performance , 1993 .

[27]  Jie Zhang,et al.  Automated refactoring of nested-IF formulae in spreadsheets , 2018, ESEC/SIGSOFT FSE.

[28]  Jácome Cunha,et al.  SmellSheet detective: A tool for detecting bad smells in spreadsheets , 2012, 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[29]  Rui Abreu,et al.  On the empirical evaluation of similarity coefficients for spreadsheets fault localization , 2014, Automated Software Engineering.

[30]  Wanjun Chen,et al.  CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[31]  Jian Lu,et al.  How effectively can spreadsheet anomalies be detected: An empirical study , 2017, J. Syst. Softw..

[32]  Andrea Zisman,et al.  Inconsistency Management in Software Engineering: Survey and Open Research Issues , 2000 .

[33]  Martin Erwig,et al.  UCheck: A spreadsheet type checker for end users , 2007, J. Vis. Lang. Comput..

[34]  Alexey O. Shigarov,et al.  Rule-based spreadsheet data transformation from arbitrary to relational tables , 2017, Inf. Syst..

[35]  Arie van Deursen,et al.  Detecting and refactoring code smells in spreadsheet formulas , 2013, Empirical Software Engineering.

[36]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[37]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[38]  Benjamin Livshits,et al.  Melford: Using Neural Networks to Find Spreadsheet Errors , 2017 .

[39]  Martin Erwig,et al.  Automatic detection of dimension errors in spreadsheets , 2009, J. Vis. Lang. Comput..

[40]  Jácome Cunha,et al.  Model-Based Programming Environments for Spreadsheets , 2012, SBLP.

[41]  Xiaoxing Ma,et al.  WARDER: Refining Cell Clustering for Effective Spreadsheet Defect Detection via Validity Properties , 2019, 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS).

[42]  Sumit Gulwani,et al.  Spreadsheet table transformations from examples , 2011, PLDI '11.

[43]  Arie van Deursen,et al.  Detecting code smells in spreadsheet formulas , 2011, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[44]  Sumit Gulwani,et al.  Synthesizing Number Transformations from Input-Output Examples , 2012, CAV.

[45]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[46]  Jian Lu,et al.  Generic Adaptive Scheduling for Efficient Context Inconsistency Detection , 2021, IEEE Transactions on Software Engineering.

[47]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[48]  Jácome Cunha,et al.  Smelling Faults in Spreadsheets , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[49]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[50]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.