Cell Classification for Layout Recognition in Spreadsheets

Spreadsheets compose a notably large and valuable dataset of documents within the enterprise settings and on the Web. Although spreadsheets are intuitive to use and equipped with powerful functionalities, extracting and reusing data from them remains a cumbersome and mostly manual task. Their greatest strength, the large degree of freedom they provide to the user, is at the same time also their greatest weakness, since data can be arbitrarily structured. Therefore, in this paper we propose a supervised learning approach for layout recognition in spreadsheets. We work on the cell level, aiming at predicting their correct layout role, out of five predefined alternatives. For this task we have considered a large number of features not covered before by related work. Moreover, we gather a considerably large dataset of annotated cells, from spreadsheets exhibiting variability in format and content. Our experiments, with five different classification algorithms, show that we can predict cell layout roles with high accuracy. Subsequently, in this paper we focus on revising the classification results, with the aim of repairing misclassifications. We propose a sophisticated approach, composed of three steps, which effectively corrects a reasonable number of inaccurate predictions.

[1]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[5]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[6]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[7]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[8]  Zhe Chen,et al.  Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[9]  Harald C. Gall,et al.  Mining evolution data of a product family , 2005, MSR '05.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[12]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[13]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[14]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.

[15]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[16]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[17]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[18]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.