To Not Miss the Forest for the Trees - A Holistic Approach for Explaining Missing Answers over Nested Data

Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven useful, e.g., to debug complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach to produce query-based explanations. It is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting, projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation on Spark, we demonstrate that our approach is the first to scale to large datasets while often finding explanations that existing techniques fail to identify.

[1]  Daniel Deutch,et al.  Reverse-Engineering Conjunctive Queries from Provenance Examples , 2019, EDBT.

[2]  Ying Yang,et al.  Adaptive Schema Databases , 2017, CIDR.

[3]  Cong Yu,et al.  Enabling Schema-Free XQuery with meaningful query focus , 2008, The VLDB Journal.

[4]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[5]  Torsten Grust,et al.  You Say 'What', I Hear 'Where' and 'Why'? (Mis-)Interpreting SQL to Derive Fine-Grained Provenance , 2018, Proc. VLDB Endow..

[6]  Melanie Herschel,et al.  Query-Based Why-Not Provenance with NedExplain , 2014, EDBT.

[7]  Ioana Manolescu,et al.  Reuse-based Optimization for Pig Latin , 2016, CIKM.

[8]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[9]  Daniel Deutch,et al.  Explaining Missing Query Results in Natural Language , 2020, EDBT.

[10]  Nick Koudas,et al.  Generating targeted queries for database testing , 2008, SIGMOD Conference.

[11]  Miryung Kim,et al.  Adding data provenance support to Apache Spark , 2017, The VLDB Journal.

[12]  Khalid Belhajjame On Answering Why-Not Queries Against Scientific Workflow Provenance , 2018, EDBT.

[13]  Chen Wang,et al.  Extended XML Tree Pattern Matching: Theories and Algorithms , 2011, IEEE Transactions on Knowledge and Data Engineering.

[14]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[15]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[16]  Daniel Deutch,et al.  Putting Lipstick on Pig: Enabling Database-style Workflow Provenance , 2011, Proc. VLDB Endow..

[17]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[18]  Limsoon Wong,et al.  Query Languages for Bags and Aggregate Functions , 1997, J. Comput. Syst. Sci..

[19]  Nick Koudas,et al.  Interactive query refinement , 2009, EDBT '09.

[20]  Meihui Zhang,et al.  Reverse Engineering Aggregation Queries , 2017, Proc. VLDB Endow..

[21]  Melanie Herschel,et al.  Tracing nested data with structural provenance for big data analytics , 2020, EDBT.

[22]  Tova Milo,et al.  Towards Tractable Algebras for Bags , 1996, J. Comput. Syst. Sci..

[23]  Michael Benedikt,et al.  SPARQLByE: Querying RDF data by example , 2016, Proc. VLDB Endow..

[24]  Abdussalam Alawini,et al.  Fine-Grained Provenance for Matching & ETL , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[25]  Melanie Herschel,et al.  Efficient Computation of Polynomial Explanations of Why-Not Questions , 2015, CIKM.

[26]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[27]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[28]  Boris Glavic,et al.  Query-based Why-not Explanations for Nested Data , 2019 .

[29]  Srinivasan Parthasarathy,et al.  Query reverse engineering , 2014, The VLDB Journal.

[30]  Adriane Chapman,et al.  Why Not? , 1965, SIGMOD Conference.

[31]  Val Tannen,et al.  Annotated XML: queries and provenance , 2008, PODS.

[32]  Quoc Trung Tran,et al.  How to ConQueR why-not questions , 2010, SIGMOD Conference.

[33]  Michael J. Carey,et al.  A performance study of big data analytics platforms , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[34]  Pablo Barceló A Theoretical View on Reverse Engineering Problems for Database Query Languages , 2019, Description Logics.

[35]  Melanie Herschel A Hybrid Approach to Answering Why-Not Questions on Relational Query Results , 2015, JDIQ.

[36]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[37]  Daniel Deutch,et al.  NLProveNAns: Natural Language Provenance for Non-Answers , 2018, Proc. VLDB Endow..

[38]  Shimin Chen,et al.  Exploiting Common Patterns for Tree-Structured Data , 2017, SIGMOD Conference.

[39]  Moshé M. Zloof Query-by-Example: A Data Base Language , 1977, IBM Syst. J..

[40]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[41]  Laks V. S. Lakshmanan,et al.  FastQRE: Fast Query Reverse Engineering , 2018, SIGMOD Conference.

[42]  Gautam Das,et al.  A holistic and principled approach for the empty-answer problem , 2016, The VLDB Journal.

[43]  Ken Yocum,et al.  Scalable lineage capture for debugging DISC analytics , 2013, SoCC.