WARDER: Refining Cell Clustering for Effective Spreadsheet Defect Detection via Validity Properties

Spreadsheets are widely used, but subject to various defects and severe consequences due to poor maintenance by end users. Existing spreadsheet defect detection techniques fall short of effectiveness, either due to limited scopes or relying on rigid patterns. In this paper, we discuss and improve one state-of-the-art technique, CUSTODES, which uses cell clustering and anomaly detection to extend its scope and make its patterns adaptive to varying spreadsheet styles, but is prone to fragile clustering when involving irrelevant cells, leading to a largely reduced detection precision. We present WARDER to refine CUSTODES's cell clustering based on validity properties, and experimental results show that WARDER improves the precision by 20.7% on average or reach 100% for 79.8% worksheets on cell clustering, which contributes to a precision improvement of 23.1% for defect detection. WARDER also exhibits satisfactory results, against other spreadsheet defect detection techniques, and on another large-scale spreadsheet corpus VEnron2.

[1]  Wanjun Chen,et al.  CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[2]  Jian Lu,et al.  How effectively can spreadsheet anomalies be detected: An empirical study , 2017, J. Syst. Softw..

[3]  Y. Chauhan,et al.  Growth in a Time of Debt , 2015 .

[4]  Stephen G. Powell,et al.  A comparison of spreadsheet users with different levels of experience , 2009 .

[5]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[6]  Luc De Raedt,et al.  Learning constraints in spreadsheets and tabular data , 2017, Machine Learning.

[7]  Sumit Gulwani,et al.  Spreadsheet table transformations from examples , 2011, PLDI '11.

[8]  Arie van Deursen,et al.  Detecting and refactoring code smells in spreadsheet formulas , 2013, Empirical Software Engineering.

[9]  Raymond R. Panko,et al.  What We Don't Know About Spreadsheet Errors Today: The Facts, Why We Don't Believe Them, and What We Need to Do , 2016, ArXiv.

[10]  Jácome Cunha,et al.  Smelling Faults in Spreadsheets , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[11]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[12]  Roland Mittermeir,et al.  Finding high-level structures in spreadsheet programs , 2002, Ninth Working Conference on Reverse Engineering, 2002. Proceedings..

[13]  Jian Lu,et al.  Generic Adaptive Scheduling for Efficient Context Inconsistency Detection , 2021, IEEE Transactions on Software Engineering.

[14]  Andrea Zisman,et al.  Inconsistency Management in Software Engineering: Survey and Open Research Issues , 2000 .

[15]  Kenneth N. Berk,et al.  Data Analysis With Microsoft Excel , 2000 .

[16]  Jie Wang,et al.  SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[17]  Dennis F. Galletta,et al.  An empirical study of spreadsheet error-finding performance , 1993 .

[18]  Brian Knight,et al.  Classification of Spreadsheet Errors , 2008, ArXiv.

[19]  Arie van Deursen,et al.  Detecting code smells in spreadsheet formulas , 2011, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[20]  Benjamin Livshits,et al.  Melford: Using Neural Networks to Find Spreadsheet Errors , 2017 .

[21]  Martin Erwig,et al.  Automatic detection of dimension errors in spreadsheets , 2009, J. Vis. Lang. Comput..

[22]  Chang Xu,et al.  CACheck: Detecting and Repairing Cell Arrays in Spreadsheets , 2017, IEEE Transactions on Software Engineering.

[23]  Sumit Gulwani,et al.  Synthesizing Number Transformations from Input-Output Examples , 2012, CAV.

[24]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[25]  Jorma Sajaniemi Modeling Spreadsheet Audit: A Rigorous Approach to Automatic Visualization , 2000, J. Vis. Lang. Comput..

[26]  Glencora Borradaile,et al.  Planted-model evaluation of algorithms for identifying differences between spreadsheets , 2012, 2012 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[27]  Mike O'Hara,et al.  Spreadsheet Auditing Software , 2010, ArXiv.

[28]  Jie Zhang,et al.  Automated refactoring of nested-IF formulae in spreadsheets , 2018, ESEC/SIGSOFT FSE.

[29]  Roland Mittermeir,et al.  Auditing Large Spreadsheet Programs , 2003 .

[30]  Danny Dig,et al.  Refactoring meets spreadsheet formulas , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[31]  Arie van Deursen,et al.  Data clone detection and visualization in spreadsheets , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[32]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[33]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[34]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[35]  Bo Yang,et al.  Detecting faulty empty cells in spreadsheets , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[36]  Rui Abreu,et al.  On the empirical evaluation of similarity coefficients for spreadsheets fault localization , 2014, Automated Software Engineering.

[37]  Raymond R. Panko,et al.  Spreadsheet Errors: What We Know. What We Think We Can Do , 2008, ArXiv.

[38]  Raymond R. Panko,et al.  Revising the Panko-Halverson taxonomy of spreadsheet errors , 2008, Decis. Support Syst..

[39]  Emery D. Berger,et al.  ExceLint: automatically finding spreadsheet formula errors , 2018, Proc. ACM Program. Lang..

[40]  Jun Wei,et al.  Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation , 2014, ICSE.

[41]  Gregor Engels,et al.  Systematic evolution of model-based spreadsheet applications , 2012, J. Vis. Lang. Comput..

[42]  Martin Erwig,et al.  UCheck: A spreadsheet type checker for end users , 2007, J. Vis. Lang. Comput..

[43]  Rui Abreu,et al.  On the Empirical Evaluation of Fault Localization Techniques for Spreadsheets , 2013, FASE.

[44]  Jácome Cunha,et al.  Model-based programming environments for spreadsheets , 2014, Sci. Comput. Program..