Approximate Denial Constraints

The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining approximate DCs (i.e., DCs that are "almost" satisfied) from data. Considering approximate constraints allows us to discover more accurate constraints in inconsistent databases, detect rules that are generally correct but may have a few exceptions, as well as avoid overfitting and obtain more general and less contrived constraints. We introduce the algorithm ADCMiner for mining approximate DCs. An important feature of this algorithm is that it does not assume any specific definition of an approximate DC, but takes the semantics as input. Since there is more than one way to define an approximate DC and different definitions may produce very different results, we do not focus on one definition, but rather on a general family of approximation functions that satisfies some natural axioms defined in this paper and captures commonly used definitions of approximate constraints. We also show how our algorithm can be combined with sampling to return results with high accuracy while significantly reducing the running time.

[1]  Paola Vera-Licona,et al.  The minimal hitting set generation problem: algorithms and computation , 2016, SIAM J. Discret. Math..

[2]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[3]  Marc Boullé,et al.  Universal Approximation of Edge Density in Large Graphs , 2015, ArXiv.

[4]  Felix Naumann,et al.  Discovery of Approximate (and Exact) Denial Constraints , 2019, Proc. VLDB Endow..

[5]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[6]  Chengfei Liu,et al.  Discover Dependencies from Data—A Review , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  H. White,et al.  “Structural Equivalence of Individuals in Social Networks” , 2022, The SAGE Encyclopedia of Research Design.

[8]  Jan Chomicki,et al.  Minimal-change integrity maintenance using tuple deletions , 2002, Inf. Comput..

[9]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[10]  Pietro Sala,et al.  Mining approximate temporal functional dependencies with pure temporal grouping in clinical databases , 2015, Comput. Biol. Medicine.

[11]  Santosh S. Vempala,et al.  Algorithms for implicit hitting set problems , 2011, SODA '11.

[12]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[13]  Lhouari Nourine,et al.  Partial Enumeration of Minimal Transversals of a Hypergraph , 2015, CLA.

[14]  Uriel Feige,et al.  On sums of independent random variables with unbounded variance, and estimating the average degree in a graph , 2004, STOC '04.

[15]  Rui Abreu,et al.  A Low-Cost Approximate Minimal Hitting Set Algorithm and its Application to Model-Based Diagnosis , 2009, SARA.

[16]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[17]  Benny Kimelfeld,et al.  Computing Optimal Repairs for Functional Dependencies , 2017, PODS.

[18]  Laks V. S. Lakshmanan,et al.  Discovering Conditional Functional Dependencies , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Leopoldo E. Bertossi,et al.  Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics , 2006, ICDT.

[20]  Rosine Cicchetti,et al.  FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies , 2001, ICDT.

[21]  Jean-Marc Petit,et al.  Efficient Discovery of Functional Dependencies and Armstrong Relations , 2000, EDBT.

[22]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[23]  Staal A. Vinterbo,et al.  Minimal approximate hitting sets and rule templates , 2000, Int. J. Approx. Reason..

[24]  Rui Abreu,et al.  MHS2: A Map-Reduce Heuristic-Driven Minimal Hitting Set Search Algorithm , 2013, MUSEPAT.

[25]  Felix Naumann,et al.  Efficient Denial Constraint Discovery with Hydra , 2017, Proc. VLDB Endow..

[26]  Eduardo Cunha de Almeida,et al.  BFASTDC: A Bitwise Algorithm for Mining Denial Constraints , 2018, DEXA.

[27]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[28]  Ihab F. Ilyas,et al.  Principles of Progress Indicators for Database Repairing , 2019, ArXiv.

[29]  Tao Jiang,et al.  Discovering Approximate Functional Dependencies from Distributed Big Data , 2016, APWeb.

[30]  Phipps Arabie,et al.  Constructing blockmodels: How and why , 1978 .

[31]  Theodoros Rekatsinas,et al.  Approximate Inference in Structured Instances with Noisy Categorical Observations , 2019, UAI.

[32]  Floris Geerts,et al.  Revisiting Conditional Functional Dependency Discovery: Splitting the "C" from the "FD" , 2018, ECML/PKDD.

[33]  Peter A. Flach,et al.  Database Dependency Discovery: A Machine Learning Approach , 1999, AI Commun..

[34]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[35]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.

[36]  Nabil H. Mustafa,et al.  Practical and efficient algorithms for the geometric hitting set problem , 2018, Discret. Appl. Math..

[37]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[38]  Guy Van den Broeck,et al.  The most probable database problem , 2014 .

[39]  Reuven Bar-Yehuda,et al.  A Linear-Time Approximation Algorithm for the Weighted Vertex Cover Problem , 1981, J. Algorithms.

[40]  Takeaki Uno,et al.  Efficient algorithms for dualizing large-scale hypergraphs , 2011, Discret. Appl. Math..

[41]  Dana Ron,et al.  On Estimating the Average Degree of a Graph , 2004, Electron. Colloquium Comput. Complex..