FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining

Tables store rich numerical data, but numerical reasoning over tables is still a challenge. In this paper, we find that the spreadsheet formula, which performs calculations on numerical values in tables, is naturally a strong supervision of numerical reasoning. More importantly, large amounts of spreadsheets with expert-made formulae are available on the web and can be obtained easily. FORTAP is the first method for numerical-reasoning-aware table pretraining by leveraging large corpus of spreadsheet formulae. We design two formula pretraining tasks to explicitly guide FORTAP to learn numerical reference and calculation in semi-structured tables. FORTAP achieves state-of-the-art results on two representative downstream tasks, cell type classification and formula prediction, showing great potential of numerical-reasoning-aware pretraining.

[1]  Wolfgang Lehner,et al.  Active Learning for Spreadsheet Cell Classification , 2020, EDBT/ICDT Workshops.

[2]  Haoyu Dong,et al.  TUTA: Tree-based Transformers for Generally Structured Table Pre-training , 2021, KDD.

[3]  Wenhu Chen,et al.  Logical Natural Language Generation from Open-Domain Tables , 2020, ACL.

[4]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[5]  Hiroya Takamura,et al.  Towards Table-to-Text Generation with Numerical Reasoning , 2021, ACL.

[6]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[7]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[8]  Zeqi Lin,et al.  NEURAL SQL EXECUTOR , 2021 .

[9]  Yan Gao,et al.  HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation , 2021, ArXiv.

[10]  Matthew Richardson,et al.  Structure-Grounded Pretraining for Text-to-SQL , 2021, NAACL.

[11]  Sumit Gulwani,et al.  Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples , 2018, ICLR.

[12]  Rishabh Singh,et al.  SpreadsheetCoder: Formula Prediction from Semi-structured Context , 2021, ICML.

[13]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[14]  M. Fisher,et al.  The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms , 2005, ACM SIGSOFT Softw. Eng. Notes.

[15]  Guillaume Lample,et al.  DOBF: A Deobfuscation Pre-Training Objective for Programming Languages , 2021, NeurIPS.

[16]  Dongmei Zhang,et al.  TableSense: Spreadsheet Table Detection with Convolutional Neural Networks , 2019, AAAI.

[17]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[18]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[19]  Pedro A. Szekely,et al.  Tabular Cell Classification Using Pre-Trained Cell Embeddings , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[20]  Wanjun Chen,et al.  CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[21]  Jun Wei,et al.  Is spreadsheet ambiguity harmful? detecting and repairing spreadsheet smells due to ambiguous computation , 2014, ICSE.

[22]  Dan Roth,et al.  Learning to Reason for Text Generation from Scientific Tables , 2021, ArXiv.

[23]  Emery D. Berger,et al.  ExceLint: automatically finding spreadsheet formula errors , 2018, Proc. ACM Program. Lang..

[24]  Jun Wei,et al.  Detecting table clones and smells in spreadsheets , 2016, SIGSOFT FSE.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[27]  Emerson R. Murphy-Hill,et al.  Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[28]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[29]  Wolfgang Lehner,et al.  DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[30]  Yan Gao,et al.  Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation , 2019, ACL.

[31]  Benjamin Livshits,et al.  Melford: Using Neural Networks to Find Spreadsheet Errors , 2017 .

[32]  Zhouyu Fu,et al.  Semantic Structure Extraction for Spreadsheet Tables with a Multi-task Learning Architecture , 2019 .

[33]  Felienne Hermans,et al.  A grammar for spreadsheet formulas evaluated on two large datasets , 2015, 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM).