Multi-Source Causal Analysis: Learning Bayesian Networks from Multiple Datasets

We argue that causality is a useful, if not a necessary concept to allow the integrative analysis of multiple data sources. Specifically, we show that it enables learning causal relations from (a) data obtained over different experimental conditions, (b) data over different variable sets, and (c) data over semantically similar variables that nevertheless cannot be pulled together for various technical reasons. The latter case particularly, often occurs in the setting of analyzing multiple gene-expression datasets. For cases (a) and (b) above there already exist preliminary algorithms that address them, albeit with some limitations, while for case (c) we develop and evaluate a new method. Preliminary empirical results provide evidence of increased learning performance of causal relations when multiple sources are combined using our method versus learning from each individual dataset. In the context of the above discussion we introduce the problem of Multi-Source Causal Analysis (MSCA), defined as the problem of inferring and inducing causal knowledge from multiple sources of data and knowledge. The grand vision of MSCA is to enable the automated or semi-automated, large-scale integration of available data to construct causal models involving a significant part of human concepts.