Efficiently enumerating hitting sets of hypergraphs arising in data profiling

The transversal hypergraph problem asks to enumerate the minimal hitting sets of a hypergraph. If the solutions have bounded size, Eiter and Gottlob [SICOMP’95] gave an algorithm running in output-polynomial time, but whose space requirement also scales with the output. We improve this to polynomial delay and space. Central to our approach is the extension problem, deciding for a set X of vertices whether it is contained in any minimal hitting set. We show that this is one of the first natural problems to be W[3]-complete. We give an algorithm for the extension problem running in time O(m|X|+1 n) and prove a SETH-lower bound showing that this is close to optimal. We apply our enumeration method to the discovery problem of minimal unique column combinations from data profiling. Our empirical evaluation suggests that the algorithm outperforms its worst-case guarantees on hypergraphs stemming from real-world databases. An extended abstract of this work was presented at the 21st Meeting on Algorithm Engineering and Experiments (ALENEX 2019) [8]. ∗Corresponding author (martin.schirneck@hpi.de). Email addresses: thomas.blaesius@kit.edu (Thomas Bläsius), tobias.friedrich@hpi.de (Tobias Friedrich), jsl71@cantab.ac.uk (Julius Lischeid), kitty.meeks@glasgow.ac.uk (Kitty Meeks), martin.schirneck@hpi.de (Martin Schirneck) 1This work originated while all but the fourth author were affiliated with the Hasso Plattner Institute at the University of Potsdam. 2Kitty Meeks is supported by a Personal Research Fellowship from the Royal Society of Edinburgh, funded by the Scottish Government. Preprint submitted to Journal of Computer and System Sciences.

[1]  Toshihide Ibaraki,et al.  Complexity of Identification and Dualization of Positive Boolean Functions , 1995, Inf. Comput..

[2]  Jianer Chen,et al.  On Product Covering in Supply Chain Models: Natural Complete Problems for W[3] and W[4] , 2005, AAIM.

[3]  Christos H. Papadimitriou,et al.  Incremental Recompilation of Knowledge , 1994, AAAI.

[4]  Georg Gottlob,et al.  Computational aspects of monotone dualization: A brief survey , 2008, Discret. Appl. Math..

[5]  Juho Lauri,et al.  Engineering Motif Search for Large Graphs , 2015, ALENEX.

[6]  Nadia Creignou,et al.  On Generating All Solutions of Generalized Satisfiability Problems , 1997, RAIRO Theor. Informatics Appl..

[7]  Vladimir Gurvich,et al.  A global parallel algorithm for the hypergraph transversal problem , 2007, Inf. Process. Lett..

[8]  Leonid Khachiyan,et al.  On the Complexity of Dualization of Monotone Disjunctive Normal Forms , 1996, J. Algorithms.

[9]  Russell Impagliazzo,et al.  Completeness for First-order Properties on Sparse Structures with Algorithmic Applications , 2017, SODA.

[10]  Matthias Hagen,et al.  Some Fixed-Parameter Tractable Classes of Hypergraph Duality and Related Problems , 2008, IWPEC.

[11]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[12]  Rolf Niedermeier,et al.  Invitation to Fixed-Parameter Algorithms , 2006 .

[13]  Robert E. Tarjan,et al.  Bounds on Backtrack Algorithms for Listing Cycles, Paths, and Spanning Trees , 1975, Networks.

[14]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[15]  Russell Impagliazzo,et al.  Which Problems Have Strongly Exponential Complexity? , 2001, J. Comput. Syst. Sci..

[16]  Felix Naumann,et al.  Data dependencies for query optimization: a survey , 2021, The VLDB Journal.

[17]  Michael Lampis,et al.  Upper Dominating Set: Tight Algorithms for Pathwidth and Sub-Exponential Approximation , 2021, CIAC.

[18]  Peter Damaschke,et al.  Parameterized algorithms for double hypergraph dualization with rank limitation and maximum minimal vertex cover , 2011, Discret. Optim..

[19]  Michael R. Fellows,et al.  On the parameterized complexity of multiple-interval graph problems , 2009, Theor. Comput. Sci..

[20]  Ge Xia,et al.  Strong computational lower bounds via parameterized complexity , 2006, J. Comput. Syst. Sci..

[21]  Vangelis Th. Paschos,et al.  The many facets of upper domination , 2017, Theor. Comput. Sci..

[22]  Ryan Williams,et al.  A new algorithm for optimal 2-constraint satisfaction and its implications , 2005, Theor. Comput. Sci..

[23]  Vangelis Th. Paschos,et al.  On the max min vertex cover problem , 2015, Discret. Appl. Math..

[24]  Raymond Reiter,et al.  A Theory of Diagnosis from First Principles , 1986, Artif. Intell..

[25]  Felix Naumann,et al.  Scalable Discovery of Unique Column Combinations , 2013, Proc. VLDB Endow..

[26]  Hector Garcia-Molina,et al.  How to assign votes in a distributed system , 1985, JACM.

[27]  Tobias Friedrich,et al.  Efficiently Enumerating Hitting Sets of Hypergraphs Arising in Data Profiling , 2018, ALENEX.

[28]  Lhouari Nourine,et al.  About Keys of Formal Context and Conformal Hypergraph , 2008, ICFCA.

[29]  Yann Strozecki,et al.  Incremental delay enumeration: Space and time , 2019, Discret. Appl. Math..

[30]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[31]  Henning Fernau,et al.  Extension of vertex cover and independent set in some classes of graphs and generalizations , 2019, CIAC.

[32]  Michael R. Fellows,et al.  Fundamentals of Parameterized Complexity , 2013 .

[33]  P. Hammer,et al.  Dual subimplicants of positive Boolean functions , 1998 .

[34]  Thomas Eiter,et al.  Exact Transversal Hypergraphs and Application to Boolean µ-Functions , 1994, J. Symb. Comput..

[35]  Michal Pilipczuk,et al.  Parameterized Algorithms , 2015, Springer International Publishing.

[36]  Russell Impagliazzo,et al.  On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[37]  Russell Impagliazzo,et al.  Nondeterministic Extensions of the Strong Exponential Time Hypothesis and Consequences for Non-reducibility , 2016, Electron. Colloquium Comput. Complex..