Table Identification and Reconstruction in Spreadsheets

Spreadsheets are one of the most successful content generation tools, used in almost every enterprise to perform data transformation, visualization, and analysis. The high degree of freedom provided by these tools results in very complex sheets, intermingling the actual data with formatting, formulas, layout artifacts, and textual metadata. To unlock the wealth of data contained in spreadsheets, a human analyst will often have to understand and transform the data manually. To overcome this cumbersome process, we propose a framework that is able to automatically infer the structure and extract the data from these documents in a canonical form. In this paper, we describe our heuristics-based method for discovering tables in spreadsheets, given that each cell is classified as either header, attribute, metadata, data, or derived. Experimental results on a real-world dataset of 439 worksheets (858 tables) show that our approach is feasible and effectively identifies tables within partially structured spreadsheets.

[1]  Hrushikesha Mohanty,et al.  Big Data: A Primer , 2015 .

[2]  Dimitris Papadias,et al.  Spatial Relations, Minimum Bounding Rectangles, and Spatial Data Structures , 1997, Int. J. Geogr. Inf. Sci..

[3]  Daniel E. O'Leary,et al.  Embedding AI and Crowdsourcing in the Big Data Lake , 2014, IEEE Intelligent Systems.

[4]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[5]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[6]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[7]  John Domingue,et al.  The Web of Data: Bridging the Skills Gap , 2014, IEEE Intelligent Systems.

[8]  Gregg Rothermel,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[9]  Douglas R. Caldwell,et al.  Unlocking the Mysteries of the Bounding Box , 2005 .

[10]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[11]  Kai Ming Ting Precision and Recall , 2017, Encyclopedia of Machine Learning and Data Mining.

[12]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[13]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[14]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.