XLIndy: Interactive Recognition and Information Extraction in Spreadsheets

Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.

[1]  Cliff T. Ragsdale,et al.  Spreadsheet modeling and decision analysis , 1996 .

[2]  Dongmei Zhang,et al.  TableSense: Spreadsheet Table Detection with Convolutional Neural Networks , 2019, AAAI.

[3]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[4]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[5]  Wolfgang Lehner,et al.  Table Recognition in Spreadsheets via a Graph Representation , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[6]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[7]  Wolfgang Lehner,et al.  DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[8]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[9]  Zhe Chen,et al.  Senbazuru: A Prototype Spreadsheet Database Management System , 2013, Proc. VLDB Endow..

[10]  Wolfgang Lehner,et al.  Cell Classification for Layout Recognition in Spreadsheets , 2016, IC3K.

[11]  Alexey O. Shigarov,et al.  Rule-based spreadsheet data transformation from arbitrary to relational tables , 2017, Inf. Syst..

[12]  George Nagy,et al.  Transforming Web Tables to a Relational Database , 2014, 2014 22nd International Conference on Pattern Recognition.

[13]  Wolfgang Lehner,et al.  Table Identification and Reconstruction in Spreadsheets , 2017, CAiSE.

[14]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[15]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[16]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[17]  Wolfgang Lehner,et al.  A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).