INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]

A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE an end-to-end data exploration system that leverages, on the one hand, Machine Learning and, on the other hand, semantics for the purpose of Data Management (DM). Our vision is to develop a classic unified, comprehensive platform that provides extensive access to open datasets, and we demonstrate it in three significant use cases in the fields of Cancer Biomarker Research, Research and Innovation Policy Making, and Astrophysics. INODE offers sustainable services in (a) data modeling and linking, (b) integrated query processing using natural language, (c) guidance, and (d) data exploration through visualization, thus facilitating the user in discovering new insights. We demonstrate that our system is uniquely accessible to a wide range of users from larger scientific communities to the public. Finally, we briefly illustrate how this work paves the way for new research opportunities in DM.

[1]  Tim Kraska,et al.  The Case for a Learned Sorting Algorithm , 2020, SIGMOD Conference.

[2]  Diego Calvanese,et al.  Efficient Handling of SPARQL OPTIONAL for OBDA , 2018, SEMWEB.

[3]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[4]  Tova Milo,et al.  Automating Exploratory Data Analysis via Machine Learning: An Overview , 2020, SIGMOD Conference.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  James Caverlee,et al.  What Are You Known For?: Learning User Topical Profiles with Implicit and Explicit Footprints , 2017, SIGIR.

[7]  Gerhard Weikum,et al.  Query-Driven On-The-Fly Knowledge Base Construction , 2017, Proc. VLDB Endow..

[8]  Protiva Rahman,et al.  Evaluating interactive data systems , 2019, The VLDB Journal.

[9]  Nikolaos Papadakis,et al.  A Methodology for Open Information Extraction and Representation from Large Scientific Corpora: The CORD-19 Data Exploration Use Case , 2020, Applied Sciences.

[10]  Georgia Koutrika,et al.  Logos: a system for translating queries into narratives , 2012, SIGMOD Conference.

[11]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[12]  Pierre Senellart,et al.  ProvSQL: Provenance and Probability Management in PostgreSQL , 2018, Proc. VLDB Endow..

[13]  Zhe Zhao,et al.  Improving User Topic Interest Profiles by Behavior Factorization , 2015, WWW.

[14]  Ursin Brunner,et al.  Entity Matching with Transformer Architectures - A Step Forward in Data Integration , 2020, EDBT.

[15]  Ursin Brunner,et al.  ValueNet: A Natural Language-to-SQL System that Learns from Database Information , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[16]  Abraham Bernstein,et al.  A comparative survey of recent natural language interfaces for databases , 2019, The VLDB Journal.

[17]  Donald Kossmann,et al.  SODA: Generating SQL for Business Users , 2012, Proc. VLDB Endow..

[18]  Umar Farooq Minhas,et al.  ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores , 2016, Proc. VLDB Endow..

[19]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[20]  Jean-Marc Petit,et al.  SQL Query Completion for Data Exploration , 2018, ArXiv.

[21]  Dorota Glowacka,et al.  Supporting exploratory search tasks with interactive user modeling , 2013, ASIST.

[22]  Boris Müller,et al.  Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions , 2016, IEEE Transactions on Visualization and Computer Graphics.

[23]  Jignesh M. Patel,et al.  Ava: From Data to Insights Through Conversations , 2017, CIDR.

[24]  Maria Anisimova,et al.  Enabling semantic queries across federated bioinformatics databases , 2019, bioRxiv.

[25]  Mohamed A. Sharaf,et al.  REQUEST: A scalable framework for interactive construction of exploratory queries , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[26]  Guohui Xiao,et al.  The Virtual Knowledge Graph System Ontop , 2020, SEMWEB.

[27]  Christopher Ré,et al.  Snorkel: Fast Training Set Generation for Information Extraction , 2017, SIGMOD Conference.

[28]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[29]  Ion Stoica Systems and ML: When the Sum is Greater than Its Parts , 2020, SIGMOD Conference.

[30]  Diego Calvanese,et al.  Ontop: Answering SPARQL queries over relational databases , 2016, Semantic Web.

[31]  J. Kohlhammer,et al.  Using Signposts for Navigation in Large Graphs , 2012, Comput. Graph. Forum.

[32]  Tova Milo,et al.  Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning , 2020, SIGMOD Conference.

[33]  Olga Papaemmanouil,et al.  AIDE: An Active Learning-Based Approach for Interactive Data Exploration , 2016, IEEE Transactions on Knowledge and Data Engineering.

[34]  Hyeonji Kim,et al.  Natural language to SQL: Where are we today? , 2020, Proc. VLDB Endow..

[35]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[36]  Benjamin Recht,et al.  KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[37]  Kemele M. Endris,et al.  Question Answering on Linked Data: Challenges and Future Directions , 2016, WWW.

[38]  Carsten Binnig,et al.  IDEBench: A Benchmark for Interactive Data Exploration , 2018, SIGMOD Conference.

[39]  Sihem Amer-Yahia,et al.  Guided exploration of user groups , 2020, Proc. VLDB Endow..

[40]  Leilani Battle,et al.  Automatic example queries for ad hoc databases , 2011, SIGMOD '11.

[41]  Rachel Pottinger,et al.  Improvement of SQL Recommendation on Scientific Database , 2019, SSDBM.

[42]  Sanjiang Li,et al.  Region Connection Calculus: Its models and composition table , 2003, Artif. Intell..

[43]  Diego Calvanese,et al.  Enriching Ontology-based Data Access with Provenance , 2019, IJCAI.

[44]  Diego Calvanese,et al.  Ontology-Based Data Access: A Survey , 2018, IJCAI.

[45]  Daniel P. Miranker,et al.  Ultrawrap Mapper: A Semi-Automatic Relational Database to RDF (RDB2RDF) Mapping Tool , 2015, International Semantic Web Conference.

[46]  Eneko Agirre,et al.  A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation , 2020, ACL.