Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks

Natural language (NL) is a promising interaction paradigm for data visualization (VIS). However, there are not any NL to VIS (NL2VIS) benchmarks available. Our goal is to provide the first NL2VIS benchmark to enable and push the field of NL2VIS, especially with deep learning technologies. In this paper, we propose a NL2VIS synthesizer (NL2SQL-to-NL2VIS) that synthesizes NL2VIS benchmarks by piggybacking NL2SQL benchmarks. The intuition is based on the semantic connection between SQL queries and VIS queries: SQL queries specify what data is needed and VIS queries additionally need to specify how to visualize. However, different from SQL that has well-defined syntax, VIS languages (e.g., Vega-Lite, VizQL, ggplot2) are syntactically very different. To provide NL2VIS benchmarks that can support many VIS languages, we use a unified intermediate representation, abstract syntax trees (ASTs), for both SQL and VIS queries. We can synthesize multiple VIS trees through adding/deleting nodes to/from an SQL tree. Each VIS tree can then be converted to (any) VIS language. The NL for VIS will be modified based on the NL for SQL to reflect corresponding tree edits. We produce the first NL2VIS benchmark (nvBench), by applying NL2SQL-to-NL2VIS on a popular NL2SQL benchmark Spider, which covers 105 domains, supports seven common types of visualizations, and contains 25,750 (NL, VIS) pairs. Our method reduces the man-hour to 5.7% of developing a NL2VIS benchmark from scratch (or building a NL2VIS benchmark from scratch takes 17.5× man-hours of our method). Extensive human validation, through 23 experts and 312 crowd workers, demonstrates the high-quality of nvBench. In order to verify that nvBench can enable learning-based approaches, we develop a SEQ2VIS model. Our experimental results show that SEQ2VIS works well and significantly outperforms the state-of-the-art methods of the NL2VIS task.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Percy Liang,et al.  Data Recombination for Neural Semantic Parsing , 2016, ACL.

[4]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[5]  Carsten Binnig,et al.  Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data , 2020, SIGMOD Conference.

[6]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[7]  John Lee,et al.  Effortless Data Exploration with zenvisage: An Expressive and Interactive Visual Analytics System , 2016, Proc. VLDB Endow..

[8]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[9]  Tianyu Zhao,et al.  DeepTrack: Monitoring and Exploring Spatio-Temporal Data - A Case of Tracking COVID-19 - , 2020, Proc. VLDB Endow..

[10]  Aditya G. Parameswaran,et al.  SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics , 2015, Proc. VLDB Endow..

[11]  Xuedi Qin,et al.  Steerable Self-Driving Data Visualization , 2022, IEEE Transactions on Knowledge and Data Engineering.

[12]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..

[13]  Tao Yu,et al.  TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation , 2018, NAACL.

[14]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[15]  Wei Chen,et al.  ECharts: A declarative framework for rapid construction of web-based visualization , 2018, Vis. Informatics.

[16]  Arvind Satyanarayan,et al.  Vega-Lite: A Grammar of Interactive Graphics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[17]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Guoliang Li,et al.  DeepEye: A Data Science System for Monitoring and Exploring COVID-19 Data , 2020, IEEE Data Eng. Bull..

[20]  Carsten Binnig,et al.  IDEBench: A Benchmark for Interactive Data Exploration , 2018, SIGMOD Conference.

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  Guoliang Li,et al.  DeepEye: Towards Automatic Data Visualization , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[23]  Guoliang Li,et al.  Interactive Cleaning for Progressive Visualization through Composite Questions , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[24]  Guoliang Li,et al.  Making data visualization more efficient and effective: a survey , 2019, The VLDB Journal.

[25]  Raymond J. Mooney,et al.  Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing , 2001, ECML.

[26]  Guoliang Li,et al.  Crowdsourcing-based Data Extraction from Visualization Charts , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[27]  Michael Stonebraker,et al.  Beagle : Automated Extraction and Interpretation of Visualizations from the Web , 2017 .

[28]  Jeffrey Heer,et al.  Formalizing Visualization Design Knowledge as Constraints: Actionable and Extensible Models in Draco , 2018, IEEE Transactions on Visualization and Computer Graphics.

[29]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[30]  Kanit Wongsuphasawat,et al.  Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations , 2016, IEEE Transactions on Visualization and Computer Graphics.

[31]  Yuyu Luo,et al.  CrowdChart: Crowdsourced Data Extraction from Visualization Charts , 2020 .

[32]  Vidya Setlur,et al.  Eviza: A Natural Language Interface for Visual Analysis , 2016, UIST.

[33]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[34]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[35]  Guoliang Li,et al.  Towards Democratizing Relational Data Visualization , 2019, SIGMOD Conference.

[36]  Carsten Binnig,et al.  DBPal: A Fully Pluggable NL2SQL Training Pipeline , 2020, SIGMOD Conference.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Pat Hanrahan,et al.  Show Me: Automatic Presentation for Visual Analysis , 2007, IEEE Transactions on Visualization and Computer Graphics.

[39]  Yan Gao,et al.  Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation , 2019, ACL.

[40]  Xiaodong Liu,et al.  RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers , 2020, ACL.

[41]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[42]  Lei Cao,et al.  Human-in-the-loop Outlier Detection , 2020, SIGMOD Conference.

[43]  Aditya G. Parameswaran,et al.  Crowdsourced Data Management: Industry and Academic Perspectives , 2015, Found. Trends Databases.

[44]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[45]  Xingquan Zhu,et al.  Deep Learning for User Interest and Response Prediction in Online Display Advertising , 2020, Data Science and Engineering.

[46]  Guoliang Li,et al.  DeepEye: Visualizing Your Data by Keyword Search , 2018, EDBT.

[47]  Guoren Wang,et al.  Time-Dependent Graphs: Definitions, Applications, and Algorithms , 2019, Data Science and Engineering.

[48]  Karrie Karahalios,et al.  DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization , 2015, UIST.

[49]  Xuedi Qin,et al.  VisClean: Interactive Cleaning for Progressive Visualization , 2020, Proc. VLDB Endow..

[50]  Vidya Setlur,et al.  Inferencing underspecified natural language utterances in visual analysis , 2019, IUI.

[51]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[52]  Mirella Lapata,et al.  Coarse-to-Fine Decoding for Neural Semantic Parsing , 2018, ACL.

[53]  Yun Wang,et al.  Text-to-Viz: Automatic Generation of Infographics from Proportion-Related Natural Language Statements , 2019, IEEE Transactions on Visualization and Computer Graphics.

[54]  Hyeonji Kim,et al.  Natural language to SQL: Where are we today? , 2020, Proc. VLDB Endow..

[55]  Pat Hanrahan,et al.  VizQL: a language for query, analysis and visualization , 2006, SIGMOD Conference.

[56]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57]  E. Dong,et al.  An interactive web-based dashboard to track COVID-19 in real time , 2020, The Lancet Infectious Diseases.

[58]  Feng Xu,et al.  A Brief Review of Network Embedding , 2019, Big Data Min. Anal..

[59]  John T. Stasko,et al.  Natural Language Interfaces for Data Analysis with Visualization: Considering What Has and Could Be Asked , 2017, EuroVis.

[60]  John Stasko,et al.  NL4DV: A Toolkit for Generating Analytic Specifications for Data Visualization from Natural Language Queries , 2020, IEEE Transactions on Visualization and Computer Graphics.

[61]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[62]  Guoliang Li,et al.  DeepEye: Creating Good Data Visualizations by Keyword Search , 2018, SIGMOD Conference.

[63]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[64]  Eneko Agirre,et al.  A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation , 2020, ACL.

[65]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[66]  Hongzhi Wang,et al.  Mining conditional functional dependency rules on big data , 2020, Big Data Min. Anal..