A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets

Spreadsheets are very successful content generation tools, used in almost every enterprise to create a wealth of information. However, this information is often intermingled with various formatting, layout, and textual metadata, making it hard to identify and interpret the tabular payload. Previous works proposed to solve this problem by mainly using heuristics. Although fast to implement, these approaches fail to capture the high variability of user-generated spreadsheet tables. Therefore, in this paper, we propose a supervised approach that is able to adapt to arbitrary spreadsheet datasets. We use a graph model to represent the contents of a sheet, which carries layout and spatial features. Subsequently, we apply genetic-based approaches for graph partitioning, to recognize the parts of the graph corresponding to tables in the sheet. The search for tables is guided by an objective function, which is tuned to match the specific characteristics of a given dataset. We present the feasibility of this approach with an experimental evaluation, on a large, real-world spreadsheet corpus.

[1]  Benjamin Livshits,et al.  Melford: Using Neural Networks to Find Spreadsheet Errors , 2017 .

[2]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[3]  Ajith Abraham,et al.  Hybrid Evolutionary Algorithms: Methodologies, Architectures, and Reviews , 2007 .

[4]  A. E. Eiben,et al.  Evolutionary Algorithm Parameters and Methods to Tune Them , 2012, Autonomous Search.

[5]  Paul T. Boggs,et al.  Sequential Quadratic Programming , 1995, Acta Numerica.

[6]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[7]  Alexey O. Shigarov,et al.  Rule-based spreadsheet data transformation from arbitrary to relational tables , 2017, Inf. Syst..

[8]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[9]  Peter Sanders,et al.  Recent Advances in Graph Partitioning , 2013, Algorithm Engineering.

[10]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[12]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[13]  Yong-Hyuk Kim,et al.  Genetic approaches for graph partitioning: a survey , 2011, GECCO '11.

[14]  Wolfgang Lehner,et al.  Table Identification and Reconstruction in Spreadsheets , 2017, CAiSE.

[15]  Martin Erwig,et al.  Header and Unit Inference for Spreadsheets Through Spatial Analyses , 2004, 2004 IEEE Symposium on Visual Languages - Human Centric Computing.

[16]  Wolfgang Lehner,et al.  Table Recognition in Spreadsheets via a Graph Representation , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[17]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[18]  Chang Wook Ahn,et al.  On the practical genetic algorithms , 2005, GECCO '05.

[19]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[20]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[21]  Arie van Deursen,et al.  Automatically Extracting Class Diagrams from Spreadsheets , 2010, ECOOP.

[22]  M. Armon Rahgozar,et al.  Graph-based table recognition system , 1996, Electronic Imaging.

[23]  Franz Wotawa,et al.  A decomposition-based approach to spreadsheet testing and debugging , 2017, 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[24]  Andrius Usinskas,et al.  A SURVEY OF GENETIC ALGORITHMS APPLICATIONS FOR IMAGE ENHANCEMENT AND SEGMENTATION , 2007 .