Tabular Cell Classification Using Pre-Trained Cell Embeddings

There is a large amount of data on the web in tabular form, such as excel sheets, CSVs, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (e.g. font size, border type, background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a method to embed the semantic and contextual information about tabular cells in a low dimension cell embedding space. We then propose an RNN-based classification technique to use these cell vector representations, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings, in-domain, and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improves cell type classification performance.

[1]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[2]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.

[3]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[4]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[5]  P Wright,et al.  Presenting information in tables. , 1970, Applied ergonomics.

[6]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[7]  Zhe Chen,et al.  Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[8]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[12]  Alexey O. Shigarov,et al.  Table understanding using a rule engine , 2015, Expert Syst. Appl..

[13]  Elke A. Rundensteiner,et al.  Towards spreadsheet integration using entity identification driven by a spatial-temporal model , 2016, SAC.

[14]  Viacheslav V. Paramonov,et al.  Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets , 2016, ICIST.

[15]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[16]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[17]  Wolfgang Lehner,et al.  Cell Classification for Layout Recognition in Spreadsheets , 2016, IC3K.

[18]  Kugatsu Sadamitsu,et al.  Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture , 2017, AAAI.

[19]  Jácome Cunha,et al.  From spreadsheets to relational databases and back , 2009, PEPM '09.

[20]  Martin Erwig,et al.  Inferring templates from spreadsheets , 2006, ICSE '06.

[21]  Craig Corcoran,et al.  Semantic Classification of Tabular Datasets via Character-Level Convolutional Neural Networks , 2019, ArXiv.

[22]  Dongmei Zhang,et al.  Expandable Group Identification in Spreadsheets , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Christopher Ré,et al.  Fonduer: Knowledge Base Construction from Richly Formatted Data , 2017, SIGMOD Conference.

[24]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[25]  Yongxuan Lai,et al.  Transforming a Nonstandard Table into Formalized Tables , 2017, 2017 14th Web Information Systems and Applications Conference (WISA).

[26]  Wolfgang Lehner,et al.  Table Recognition in Spreadsheets via a Graph Representation , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[27]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[28]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Masashi Toyoda,et al.  A Bag of Useful Tricks for Practical Neural Machine Translation: Embedding Layer Initialization and Large Batch Size , 2017, WAT@IJCNLP.