FLAME: A small language model for spreadsheet formulas

The widespread use of spreadsheet environments by billions of users presents a unique opportunity for formula-authoring assistance. Although large language models, such as Codex, can assist in general-purpose languages, they are expensive to train and challenging to deploy due to their large model sizes (up to billions of parameters). Moreover, they require hundreds of gigabytes of training data. We present FLAME, a T5-based model trained on Excel formulas that leverages domain insights to achieve competitive performance with a substantially smaller model (60M parameters) and two orders of magnitude less training data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer for our model, and use domain-specific versions of masked span prediction and noisy auto-encoding as pretraining objectives. We evaluate FLAME on formula repair, formula auto-completion, and a novel task called syntax reconstruction. FLAME (60M) can outperform much larger models, such as Codex-Davinci (175B), Codex-Cushman (12B), and CodeT5 (220M), in 6 out of 10 settings.

[1]  Sumit Gulwani,et al.  Repair Is Nearly Generation: Multilingual Program Repair with LLMs , 2022, AAAI.

[2]  Sida I. Wang,et al.  InCoder: A Generative Model for Code Infilling and Synthesis , 2022, ICLR.

[3]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[4]  Neel Sundaresan,et al.  Automating code review activities by large-scale pre-training , 2022, ESEC/SIGSOFT FSE.

[5]  Colin Raffel,et al.  Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.

[6]  Nicholas Carlini,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[7]  Nan Duan,et al.  Learning to Complete Code with Sketches , 2021, ICLR.

[8]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[9]  Brayan Stiven Torrres Ovalle GitHub Copilot , 2022, Encuentro Internacional de Educación en Ingeniería.

[10]  Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , 2022 .

[11]  Sumit Gulwani,et al.  Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis , 2021, Proc. ACM Program. Lang..

[12]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[13]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[14]  Rishabh Singh,et al.  SpreadsheetCoder: Formula Prediction from Semi-structured Context , 2021, ICML.

[15]  Percy Liang,et al.  Break-It-Fix-It: Unsupervised Learning for Program Repair , 2021, ICML.

[16]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[17]  Martin Vechev,et al.  TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer , 2021, ICML.

[18]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[19]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[20]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[21]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[24]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[25]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[26]  Giuseppe Castagna,et al.  Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages , 2017, POPL 2017.

[27]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[28]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[29]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.