SpreadsheetCoder: Formula Prediction from Semi-structured Context

Spreadsheet formula prediction has been an important program synthesis problem with many real-world applications. Previous works typically utilize input-output examples as the specification for spreadsheet formula synthesis, where each input-output pair simulates a separate row in the spreadsheet. However, this formulation does not fully capture the rich context in real-world spreadsheets. First, spreadsheet data entries are organized as tables, thus rows and columns are not necessarily independent from each other. In addition, many spreadsheet tables include headers, which provide high-level descriptions of the cell data. However, previous synthesis approaches do not consider headers as part of the specification. In this work, we present the first approach for synthesizing spreadsheet formulas from tabular context, which includes both headers and semi-structured tabular data. In particular, we propose SPREADSHEETCODER, a BERT-based model architecture to represent the tabular context in both row-based and column-based formats. We train our model on a large dataset of spreadsheets, and demonstrate that SPREADSHEETCODER achieves top-1 prediction accuracy of 42.51%, which is a considerable improvement over baselines that do not employ rich tabular context. Compared to the rule-based system, SPREADSHEETCODER assists 82% more users in composing formulas on Google Sheets.

[1]  Maik Riechert,et al.  Fast and Memory-Efficient Neural Code Completion , 2020, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[2]  Benjamin Livshits,et al.  Melford: Using Neural Networks to Find Spreadsheet Errors , 2017 .

[3]  Dan Ye,et al.  Learning to detect table clones in spreadsheets , 2020, ISSTA.

[4]  Xiaodong Liu,et al.  RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers , 2019, ACL.

[5]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[6]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[7]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[8]  Dongmei Zhang,et al.  TableSense: Spreadsheet Table Detection with Convolutional Neural Networks , 2019, AAAI.

[9]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[10]  Koushik Sen,et al.  AutoPandas: neural-backed generators for program synthesis , 2019, Proc. ACM Program. Lang..

[11]  Chen Liang,et al.  Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing , 2018, NeurIPS.

[12]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[13]  Rishabh Singh,et al.  BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration , 2020, ICLR.

[14]  Arie van Deursen,et al.  Data clone detection and visualization in spreadsheets , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[15]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[16]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[17]  Michael D. Ernst,et al.  NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , 2018, LREC.

[18]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[21]  Jiajun Wu,et al.  Learning to Describe Scenes with Programs , 2018, ICLR.

[22]  Lihong Li,et al.  Neuro-Symbolic Program Synthesis , 2016, ICLR.

[23]  Dawn Song,et al.  Execution-Guided Neural Program Synthesis , 2018, ICLR.

[24]  Armando Solar-Lezama,et al.  Learning to Infer Program Sketches , 2019, ICML.

[25]  Emerson R. Murphy-Hill,et al.  Enron's Spreadsheets and Related Emails: A Dataset and Analysis , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[26]  Neel Sundaresan,et al.  Pythia: AI-assisted Code Completion System , 2019, KDD.

[27]  Sumit Gulwani,et al.  FlashMeta: a framework for inductive program synthesis , 2015, OOPSLA.

[28]  Sumit Gulwani,et al.  Neural-Guided Deductive Search for Real-Time Program Synthesis from Examples , 2018, ICLR.

[29]  Sebastian Nowozin,et al.  DeepCoder: Learning to Write Programs , 2016, ICLR.

[30]  Khubaib Amjad Alam,et al.  Spreadsheet Smells: A Systematic Mapping Study , 2019, 2019 International Conference on Frontiers of Information Technology (FIT).

[31]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[32]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[33]  Ming-Wei Chang,et al.  Search-based Neural Structured Learning for Sequential Question Answering , 2017, ACL.

[34]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[35]  Armando Solar-Lezama,et al.  Program synthesis by sketching , 2008 .

[36]  Arie van Deursen,et al.  Measuring Spreadsheet Formula Understandability , 2012, ArXiv.

[37]  Henry A. Kautz,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[38]  Sumit Gulwani,et al.  NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation , 2014, SIGMOD Conference.

[39]  Dawn Xiaodong Song,et al.  Improving Neural Program Synthesis with Inferred Execution Traces , 2018, NeurIPS.

[40]  Hyeonwoo Noh,et al.  Neural Program Synthesis from Diverse Demonstration Videos , 2018, ICML.

[41]  Wanjun Chen,et al.  CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[42]  Arie van Deursen,et al.  Detecting and visualizing inter-worksheet smells in spreadsheets , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[43]  Swarat Chaudhuri,et al.  Neural Sketch Learning for Conditional Program Generation , 2017, ICLR.

[44]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[45]  Matthew J. Hausknecht,et al.  Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis , 2018, ICLR.

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[48]  Wenhu Chen,et al.  TabFact: A Large-scale Dataset for Table-based Fact Verification , 2019, ICLR.

[49]  Zhouyu Fu,et al.  Semantic Structure Extraction for Spreadsheet Tables with a Multi-task Learning Architecture , 2019 .

[50]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[51]  Jun Wei,et al.  Detecting table clones and smells in spreadsheets , 2016, SIGSOFT FSE.

[52]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[53]  Mirella Lapata,et al.  Coarse-to-Fine Decoding for Neural Semantic Parsing , 2018, ACL.