DAGs with NO TEARS: Continuous Optimization for Structure Learning

Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this paper, we introduce a fundamentally different strategy: we formulate the structure learning problem as a purely continuous optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can be efficiently solved by standard numerical algorithms, which also makes implementation effortless. The proposed method outperforms existing ones, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree.

[1]  Qing Zhou,et al.  Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent , 2013 .

[2]  D. Heckerman,et al.  Addendum on the scoring of Gaussian directed acyclic graphical models , 2014, 1402.6863.

[3]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[4]  Adnan Darwiche,et al.  Learning Bayesian networks with ancestral constraints , 2016, NIPS.

[5]  Marco Zaffalon,et al.  Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables , 2016, NIPS.

[6]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[7]  Milan Studený,et al.  Polyhedral aspects of score equivalence in Bayesian network structure learning , 2015, Mathematical Programming.

[8]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[9]  Po-Ling Loh,et al.  High-dimensional learning of linear causal networks via inverse covariance estimation , 2013, J. Mach. Learn. Res..

[10]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[11]  Seyoung Kim,et al.  A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables , 2013, NIPS.

[12]  James Cussens,et al.  Bayesian network learning with cutting planes , 2011, UAI.

[13]  Pradeep Ravikumar,et al.  QUIC: quadratic approximation for sparse inverse covariance estimation , 2014, J. Mach. Learn. Res..

[14]  W. Wong,et al.  Learning Causal Bayesian Network Structures From Experimental Data , 2008 .

[15]  Jose Miguel Puerta,et al.  Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood , 2010, Data Mining and Knowledge Discovery.

[16]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[17]  Clark Glymour,et al.  A million variables and more: the Fast Greedy Equivalence Search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images , 2016, International Journal of Data Science and Analytics.

[18]  Qing Zhou,et al.  Inferring large graphs using $$\ell _1$$ℓ1-penalized likelihood , 2011, Stat. Comput..

[19]  L. Tran,et al.  Integrated Systems Approach Identifies Genetic Nodes and Networks in Late-Onset Alzheimer’s Disease , 2013, Cell.

[20]  Shuheng Zhou,et al.  Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation , 2009, NIPS.

[21]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[22]  Xiangyu Wang,et al.  No penalty no tears: Least squares in high-dimensional linear models , 2015, ICML.

[23]  Remco R. Bouckaert,et al.  Probalistic Network Construction Using the Minimum Description Length Principle , 1993, ECSQARU.

[24]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[25]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[26]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the lasso , 2007, 0708.3517.

[27]  R. Bouckaert Minimum Description Length Principle , 1994 .

[28]  Tomi Silander,et al.  A Simple Approach for Finding the Globally Optimal Bayesian Network Structure , 2006, UAI.

[29]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[30]  Qing Zhou,et al.  Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression , 2015, ArXiv.

[31]  Mikko Koivisto,et al.  Structure Discovery in Bayesian Networks by Sampling Partial Orders , 2016, J. Mach. Learn. Res..

[32]  Daphne Koller,et al.  Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks , 2005, UAI.

[33]  Qing Zhou,et al.  Concave penalized estimation of sparse Gaussian Bayesian networks , 2014, J. Mach. Learn. Res..

[34]  Awad H. Al-Mohy,et al.  A New Scaling and Squaring Algorithm for the Matrix Exponential , 2009, SIAM J. Matrix Anal. Appl..

[35]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[36]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[37]  Peter van Beek,et al.  Machine Learning of Bayesian Networks Using Constraint Programming , 2015, CP.

[38]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[39]  Pradeep Ravikumar,et al.  Proximal Quasi-Newton for Computationally Intensive L1-regularized M-estimators , 2014, NIPS.

[40]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[41]  Qing Zhou,et al.  Penalized estimation of directed acyclic graphs from discrete data , 2014, Stat. Comput..

[42]  Diego Klabjan,et al.  Bayesian Network Learning via Topological Order , 2017, J. Mach. Learn. Res..

[43]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[44]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[45]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[46]  S. Miyano,et al.  Finding optimal gene networks using biological constraints. , 2003, Genome informatics. International Conference on Genome Informatics.

[47]  Marco Zaffalon,et al.  Learning Bayesian Networks with Thousands of Variables , 2015, NIPS.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  S. Geer,et al.  $\ell_0$-penalized maximum likelihood for sparse directed acyclic graphs , 2012, 1205.5473.

[50]  Andrew W. Moore,et al.  Finding optimal Bayesian networks by dynamic programming , 2005 .

[51]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[52]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[53]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[54]  Mark W. Schmidt,et al.  Learning Graphical Model Structure Using L1-Regularization Paths , 2007, AAAI.

[55]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[56]  R. W. Robinson Counting unlabeled acyclic digraphs , 1977 .

[57]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[58]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[59]  Frank Harary,et al.  On the number of cycles in a graph , 1971 .

[60]  P. Spirtes,et al.  An Algorithm for Fast Recovery of Sparse Causal Graphs , 1991 .