Towards a Learning Theory of Causation

We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection {(Si, li)}i=1, where each Si is a sample drawn from the probability distribution ofXi×Yi, and li is a binary label indicating whether “Xi → Yi” or “Xi ← Yi”. Given these data, we build a causal inference rule in two steps. First, we featurize each Si using the kernel mean embedding associated with some characteristic kernel. Second, we train a binary classifier on such embeddings to distinguish between causal directions. We present generalization bounds showing the statistical consistency and learning rates of the proposed approach, and provide a simple implementation that achieves state-of-the-art cause-effect inference. Furthermore, we extend our ideas to infer causal relationships between more than two variables.

[1]  H. Reichenbach,et al.  The Direction of Time , 1959 .

[2]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[6]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[7]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[8]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[9]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[10]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[11]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[12]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[13]  Kenji Fukumizu,et al.  Semigroup Kernels on Measures , 2005, J. Mach. Learn. Res..

[14]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[15]  Andreas Maurer,et al.  The Rademacher Complexity of Linear Transformation Classes , 2006, COLT.

[16]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[17]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[18]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[19]  Bernhard Schölkopf,et al.  Nonlinear causal discovery with additive noise models , 2008, NIPS.

[20]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[21]  Aapo Hyvärinen,et al.  On the Identifiability of the Post-Nonlinear Causal Model , 2009, UAI.

[22]  Bernhard Schölkopf,et al.  Detecting the direction of causal time series , 2009, ICML '09.

[23]  Eric P. Xing,et al.  Nonextensive Information Theoretic Kernels on Measures , 2009, J. Mach. Learn. Res..

[24]  Bernhard Schölkopf,et al.  Inferring deterministic causal relations , 2010, UAI.

[25]  Bernhard Schölkopf,et al.  Probabilistic latent variable models for distinguishing between cause and effect , 2010, NIPS.

[26]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[27]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[28]  Bernhard Schölkopf,et al.  Information-geometric approach to inferring causal directions , 2012, Artif. Intell..

[29]  Bernhard Schölkopf,et al.  Learning from Distributions via Support Measure Machines , 2012, NIPS.

[30]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[31]  Bernhard Schölkopf,et al.  Causal discovery with continuous additive noise models , 2013, J. Mach. Learn. Res..

[32]  Bernhard Schölkopf,et al.  Consistency of Causal Inference under the Additive Noise Model , 2013, ICML.

[33]  Bernhard Schölkopf,et al.  Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks , 2014, J. Mach. Learn. Res..

[34]  Arthur Gretton,et al.  Learning Theory for Distribution Regression , 2014, J. Mach. Learn. Res..