Table Recognition in Spreadsheets via a Graph Representation

Spreadsheet software are very popular data management tools. Their ease of use and abundant functionalities equip novices and professionals alike with the means to generate, transform, analyze, and visualize data. As a result, spreadsheets are a great resource of factual and structured information. This accentuates the need to automatically understand and extract their contents. In this paper, we present a novel approach for recognizing tables in spreadsheets. Having inferred the layout role of the individual cells, we build layout regions. We encode the spatial interrelations between these regions using a graph representation. Based on this, we propose Remove and Conquer (RAC), an algorithm for table recognition that implements a list of carefully curated rules. An extensive experimental evaluation shows that our approach is viable. We achieve significant accuracy in a dataset of real spreadsheets from various domains.

[1]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[2]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[3]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[4]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[5]  Franz Wotawa,et al.  A decomposition-based approach to spreadsheet testing and debugging , 2017, 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[6]  Kevin Chen-Chuan Chang,et al.  DataSpread: Unifying Databases and Spreadsheets , 2015, Proc. VLDB Endow..

[7]  Wolfgang Lehner,et al.  Table Identification and Reconstruction in Spreadsheets , 2017, CAiSE.

[8]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[9]  Naoki Asada,et al.  Complex Table Form Analysis Using Graph Grammar , 2002, Document Analysis Systems.

[10]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[11]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[12]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[13]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[14]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[15]  M. Armon Rahgozar,et al.  Graph-based table recognition system , 1996, Electronic Imaging.

[16]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[17]  Arie van Deursen,et al.  Automatically Extracting Class Diagrams from Spreadsheets , 2010, ECOOP.

[18]  M. Fisher,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, WEUSE@ICSE.

[19]  Benjamin Livshits,et al.  Melford: Using Neural Networks to Find Spreadsheet Errors , 2017 .

[20]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[21]  Alexey O. Shigarov,et al.  Rule-based spreadsheet data transformation from arbitrary to relational tables , 2017, Inf. Syst..

[22]  Mechthild Stoer,et al.  A simple min-cut algorithm , 1997, JACM.

[23]  Jácome Cunha,et al.  From spreadsheets to relational databases and back , 2009, PEPM '09.