Benchmarking Multimodal Regex Synthesis with Complex Structures

Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce StructuredRegex, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of StructuredRegex over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs.

[1]  Percy Liang,et al.  Simpler Context-Dependent Logical Forms via Model Projections , 2016, ACL.

[2]  Oliver Lemon,et al.  Crowd-sourcing NLG Data: Pictures Elicit Better Data. , 2016, INLG.

[3]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[4]  Matthias Grabmair,et al.  How Would You Say It? Eliciting Lexically Diverse Dialogue for Supervised Semantic Parsing , 2017, SIGDIAL Conference.

[5]  Dan Klein,et al.  Learning with Latent Language , 2017, NAACL.

[6]  Sumit Gulwani,et al.  FIDEX: filtering spreadsheet data using examples , 2016, OOPSLA.

[7]  Regina Barzilay,et al.  Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge , 2016, EMNLP.

[8]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[9]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[10]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[11]  Isil Dillig,et al.  Sketch-Driven Regular Expression Generation from Natural Language and Examples , 2019, Transactions of the Association for Computational Linguistics.

[12]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[13]  Dongmei Zhang,et al.  Generating Regular Expressions from Natural Language Specifications: Are We There Yet? , 2018, AAAI Workshops.

[14]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[15]  Aarne Ranta,et al.  A Multilingual Natural-Language Interface to Regular Expressions , 1998 .

[16]  Sumit Gulwani,et al.  Programming by Examples: PL Meets ML , 2017, APLAS.

[17]  Jonathan Berant,et al.  Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing , 2019, EMNLP.

[18]  Armando Solar-Lezama,et al.  Learning to Infer Program Sketches , 2019, ICML.

[19]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[20]  Hakjoo Oh,et al.  Synthesizing regular expressions from examples for introductory automata assignments , 2016, GPCE.

[21]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..

[22]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Regina Barzilay,et al.  Using Semantic Unification to Generate Regular Expressions from Natural Language , 2013, NAACL.

[24]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[25]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[26]  Isil Dillig,et al.  Program synthesis using conflict-driven learning , 2017, PLDI.

[27]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .