Load-Balanced Parallel Constraint-Based Causal Structure Learning on Multi-Core Systems for High-Dimensional Data

In the context of high-dimensional data state-of-the-art methods for constraint-based causal structure learning, such as the PC algorithm, are limited in their application through their worst case exponential computational complexity. To address the resulting long execution time, several parallel extensions have been developed to exploit modern multi-core systems. These extensions apply a static distribution of tasks to the execution units to achieve parallelism, which introduces the problem of load imbalance. In our work, we propose a parallel implementation that follows a dynamic task distribution in order to avoid situations of load imbalance and improve the execution time. On the basis of an experimental evaluation on real-world high dimensional datasets, we show that our implementation has a better load balancing compared to an existing parallel implementation in the context of multivariate normal distributed data. For datasets that introduce load imbalance, our dynamic task distribution approach outperforms existing static approaches by factors up to 2.4. Overall, we increase the speed up from factors of up to 27, for the static approach, to factors of up to 39 for the dynamic approach, when scaling to 80 cores compared to a non-parallel execution.

[1]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[2]  Lin Liu,et al.  ParallelPC: An R Package for Efficient Causal Exploration in Genomic Data , 2018, PAKDD.

[3]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[4]  Junpeng Zhang,et al.  From miRNA regulation to miRNA-TF co-regulation: computational approaches and challenges , 2015, Briefings Bioinform..

[5]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[6]  D. Madigan,et al.  A characterization of Markov equivalence classes for acyclic digraphs , 1997 .

[7]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[8]  Peter Bühlmann,et al.  Predicting causal effects in large-scale systems from observational data , 2010, Nature Methods.

[9]  Jiuyong Li,et al.  A Fast PC Algorithm for High Dimensional Causal Discovery with Multi-Core PCs , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[11]  A. Cano,et al.  A Score Based Ranking of the Edges for the PC Algorithm , 2008 .

[12]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[13]  R Scheines,et al.  The TETRAD Project: Constraint Based Aids to Causal Model Specification. , 1998, Multivariate behavioral research.

[14]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[15]  Dirk Eddelbuettel,et al.  Extending R with C++: A Brief Introduction to Rcpp , 2018, PeerJ Prepr..

[16]  Peter Spirtes,et al.  Introduction to Causal Inference , 2010, J. Mach. Learn. Res..

[17]  Joaquín Abellán,et al.  Some Variations on the PC Algorithm , 2006, Probabilistic Graphical Models.

[18]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[19]  Marco Scutari,et al.  Learning Bayesian Networks with the bnlearn R Package , 2009, 0908.3817.

[20]  Diego Colombo,et al.  Order-independent constraint-based causal structure learning , 2012, J. Mach. Learn. Res..

[21]  Marco Scutari,et al.  Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimised Implementations in the bnlearn R Package , 2014, ArXiv.