Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization

Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs, as they can additionally express business rules involving order; e.g., an employee never has a higher salary while paying lower taxes compared with another employee. We address the limitations of prior work on OD discovery which has factorial complexity in the number of attributes, is incomplete (i.e., it does not discover valid ODs that cannot be inferred from the ones found) and is not concise (i.e., it can result in "redundant" discovery and overly large discovery sets). We improve significantly on complexity, offer completeness, and define a compact canonical form. This is based on a novel polynomial mapping to a canonical form for ODs, and a sound and complete set of axioms (inference rules) for canonical ODs. This allows us to develop an efficient set-containment, lattice-driven OD discovery algorithm that uses the inference rules to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete, minimal set of ODs (i.e., minimal with regards to the canonical representation). Finally, using real and synthetic datasets, we experimentally show orders-of-magnitude performance improvements over the current state-of-the-art algorithm and demonstrate effectiveness of our techniques.

[1]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[2]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[3]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[4]  Divesh Srivastava,et al.  Combining Quantitative and Logical Data Cleaning , 2015, Proc. VLDB Endow..

[5]  Sriram Padmanabhan,et al.  Predicate derivation and monotonicity detection in DB2 UDB , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  SrivastavaDivesh,et al.  Effective and complete discovery of order dependencies via set-based axiomatization , 2017, VLDB 2017.

[7]  References , 1971 .

[8]  Siu Hung Ng,et al.  An extension of the relational data model to incorporate ordered domains , 2001, TODS.

[9]  Felix Naumann,et al.  Efficient order dependency detection , 2015, The VLDB Journal.

[10]  Jarek Gryz,et al.  Fundamentals of Order Dependencies , 2012, Proc. VLDB Endow..

[11]  Philip A. Bernstein,et al.  Computational problems related to the design of normal form relational schemas , 1979, TODS.

[12]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[13]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[14]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[15]  S. Sudarshan,et al.  Optimizing Nested Queries with Parameter Sort Orders , 2005, VLDB.

[16]  Calisto Zuzarte,et al.  Expressiveness and Complexity of Order Dependencies , 2013, Proc. VLDB Endow..

[17]  Divesh Srivastava,et al.  Scaling up copy detection , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[18]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  Calisto Zuzarte,et al.  Query Performance Problem Determination with Knowledge Base in Semantic Web System OptImatch , 2016, EDBT.

[20]  Avishek Saha,et al.  Sequential Dependencies , 2009, Proc. VLDB Endow..

[21]  Eugene J. Shekita,et al.  Fundamental techniques for order optimization , 1996, SIGMOD '96.

[22]  Richard Hull,et al.  Order Dependency in the Relational Model , 1983, Theor. Comput. Sci..

[23]  Richard Hull,et al.  Applying approximate order dependency to reduce indexing space , 1982, SIGMOD '82.

[24]  Calisto Zuzarte,et al.  Business-Intelligence Queries with Order Dependencies in DB2 , 2014, EDBT.

[25]  Calisto Zuzarte,et al.  Queries on dates: fast yet not blind , 2011, EDBT/ICDT '11.

[26]  E. Fischer,et al.  Detecting and exploiting near-sortedness for efficient relational query evaluation , 2011, ICDT '11.