Quda: Natural Language Queries for Visual Data Analytics

Visualization-oriented natural language interfaces (V-NLIs) have been explored and developed in recent years. One challenge faced by V-NLIs is in the formation of effective design decisions that usually requires a deep understanding of user queries. Learning-based approaches have shown potential in V-NLIs and reached state-of-the-art performance in various NLP tasks. However, because of the lack of sufficient training samples that cater to visual data analytics, cutting-edge techniques have rarely been employed to facilitate the development of V-NLIs. We present a new dataset, called Quda, to help V-NLIs understand free-form natural language. Our dataset contains 14;035 diverse user queries annotated with 10 low-level analytic tasks that assist in the deployment of state-of-the-art techniques for parsing complex human language. We achieve this goal by first gathering seed queries with data analysts who are target users of V-NLIs. Then we employ extensive crowd force for paraphrase generation and validation. We demonstrate the usefulness of Quda in building V-NLIs by creating a prototype that makes effective design decisions for free-form user queries. We also show that Quda can be beneficial for a wide range of applications in the visualization community by analyzing the design tasks described in academic publications.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[3]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[4]  Ankush Gupta,et al.  A Deep Generative Framework for Paraphrase Generation , 2017, AAAI.

[5]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[6]  James R. Eagan,et al.  Low-level components of analytic activity in information visualization , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[7]  Jeffrey Heer,et al.  Reverse‐Engineering Visualizations: Recovering Visual Encodings from Chart Images , 2017, Comput. Graph. Forum.

[8]  Alex Endert,et al.  Task-Based Effectiveness of Basic Visualizations , 2017, IEEE Transactions on Visualization and Computer Graphics.

[9]  John T. Stasko,et al.  Natural Language Interfaces for Data Analysis with Visualization: Considering What Has and Could Be Asked , 2017, EuroVis.

[10]  Alex Endert,et al.  Broadening Intellectual Diversity in Visualization Research Papers , 2019, IEEE Computer Graphics and Applications.

[11]  Emiel Krahmer,et al.  Paraphrase Generation as Monolingual Translation: Data and Evaluation , 2010, INLG.

[12]  John Stasko,et al.  NL4DV: A Toolkit for Generating Analytic Specifications for Data Visualization from Natural Language Queries , 2020, IEEE transactions on visualization and computer graphics.

[13]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[14]  Olfa Nasraoui,et al.  Mining search engine query logs for query recommendation , 2006, WWW '06.

[15]  Walter S. Lasecki,et al.  Conversations in the Crowd: Collecting Data for Task-Oriented Dialog Learning , 2013, Proceedings of the AAAI Conference on Human Computation and Crowdsourcing.

[16]  Silvia Miksch,et al.  Task Cube: A three-dimensional conceptual space of user tasks in visualization design and evaluation , 2016, Inf. Vis..

[17]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..

[18]  Jeffrey Heer,et al.  ReVision: automated classification, analysis and redesign of chart images , 2011, UIST.

[19]  Yun Wang,et al.  Text-to-Viz: Automatic Generation of Infographics from Proportion-Related Natural Language Statements , 2019, IEEE Transactions on Visualization and Computer Graphics.

[20]  Vidya Setlur,et al.  Applying Pragmatics Principles for Interaction with Visual Analytics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[21]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[24]  Partha Talukdar,et al.  Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation , 2019, NAACL.

[25]  Michael S. Bernstein,et al.  Iris: A Conversational Agent for Complex Tasks , 2017, CHI.

[26]  Dragomir R. Radev,et al.  Improving Text-to-SQL Evaluation Methodology , 2018, ACL.

[27]  Kyomin Jung,et al.  Contextual-CNN: A Novel Architecture Capturing Unified Meaning for Sentence Classification , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[28]  Luca Becchetti,et al.  An optimization framework for query recommendation , 2010, WSDM '10.

[29]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[30]  Tamara Munzner,et al.  Visualization Analysis and Design , 2014, A.K. Peters visualization series.

[31]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[32]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[33]  Rebecca E. Grinter,et al.  A Multi-Modal Natural Language Interface to an Information Visualization Environment , 2001, Int. J. Speech Technol..

[34]  Sebastián Ventura,et al.  MLDA: A tool for analyzing multi-label datasets , 2017, Knowl. Based Syst..

[35]  Sandeep Kumar,et al.  Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator , 2018, COLING.

[36]  Michael Stonebraker,et al.  Beagle : Automated Extraction and Interpretation of Visualizations from the Web , 2017 .

[37]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[38]  Vidya Setlur,et al.  Do What I Mean, Not What I Say! Design Considerations for Supporting Intent and Context in Analytical Conversation , 2019, 2019 IEEE Conference on Visual Analytics Science and Technology (VAST).

[39]  Rahul Gupta,et al.  A task in a suit and a tie: paraphrase generation with semantic augmentation , 2018, AAAI.

[40]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[41]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[42]  Arvind Satyanarayan,et al.  Vega-Lite: A Grammar of Interactive Graphics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[43]  Hang Li,et al.  Paraphrase Generation with Deep Reinforcement Learning , 2017, EMNLP.

[44]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[45]  Benno Stein,et al.  Paraphrase acquisition via crowdsourcing and machine learning , 2013, TIST.

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Mukund Sundararajan,et al.  Analyza: Exploring Data with Conversation , 2017, IUI.

[48]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[49]  Michelle A. Borkin,et al.  What Makes a Visualization Memorable? , 2013, IEEE Transactions on Visualization and Computer Graphics.

[50]  Vidya Setlur,et al.  Eviza: A Natural Language Interface for Visual Analysis , 2016, UIST.

[51]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[52]  Hua He,et al.  A Continuously Growing Dataset of Sentential Paraphrases , 2017, EMNLP.

[53]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[54]  Domagoj Vuljak,et al.  Microsoft Power BI , 2017 .

[55]  Abhinav Kumar,et al.  Towards a dialogue system that supports rich visualizations of data , 2016, SIGDIAL Conference.

[56]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[57]  Zhiyuan Liu,et al.  A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[58]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[59]  Karrie Karahalios,et al.  DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization , 2015, UIST.

[60]  Bowen Yu,et al.  FlowSense: A Natural Language Interface for Visual Data Exploration within a Dataflow System , 2019, IEEE Transactions on Visualization and Computer Graphics.

[61]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[62]  John T. Stasko,et al.  Orko: Facilitating Multimodal Interaction for Visual Exploration and Analysis of Networks , 2018, IEEE Transactions on Visualization and Computer Graphics.

[63]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[64]  Marti Hearst,et al.  Toward Interface Defaults for Vague Modifiers in Natural Language Interfaces for Visual Analysis , 2019, 2019 IEEE Visualization Conference (VIS).

[65]  Yiwen Sun,et al.  Articulate: A Semi-automated Model for Translating Natural Language Queries into Meaningful Visualizations , 2010, Smart Graphics.

[66]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[67]  Jevin D. West,et al.  Viziometrics: Analyzing Visual Information in the Scientific Literature , 2016, IEEE Transactions on Big Data.