论文信息 - Towards a Learning Theory of Causation - 字舞流文

Towards a Learning Theory of Causation

We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection $\{(S_i,l_i)\}_{i=1}^n$, where each $S_i$ is a sample drawn from the probability distribution of $X_i \times Y_i$, and $l_i$ is a binary label indicating whether "$X_i \to Y_i$" or "$X_i \leftarrow Y_i$". Given these data, we build a causal inference rule in two steps. First, we featurize each $S_i$ using the kernel mean embedding associated with some characteristic kernel. Second, we train a binary classifier on such embeddings to distinguish between causal directions. We present generalization bounds showing the statistical consistency and learning rates of the proposed approach, and provide a simple implementation that achieves state-of-the-art cause-effect inference. Furthermore, we extend our ideas to infer causal relationships between more than two variables.

Bernhard Schölkopf | David Lopez-Paz | Krikamol Muandet | Ilya O. Tolstikhin | B. Schölkopf | I. Tolstikhin | David Lopez-Paz | Krikamol Muandet | B. Scholkopf

[1] P. Bartlett,et al. Empirical minimization , 2006 .

[2] Claudia Baier. Direction Of Time , 2016 .

[3] Bernhard Schölkopf,et al. Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[4] Aapo Hyvärinen,et al. On the Identifiability of the Post-Nonlinear Causal Model , 2009, UAI.

[5] Bernhard Schölkopf,et al. Consistency of Causal Inference under the Additive Noise Model , 2013, ICML.

[6] Barnabás Póczos,et al. Learning Theory for Vector-Valued Distribution Regression , 2015 .

[7] P. Bartlett,et al. Local Rademacher complexities , 2005, math/0508275.

[8] Le Song,et al. A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[9] Bernhard Schölkopf,et al. Nonlinear causal discovery with additive noise models , 2008, NIPS.

[10] Bernhard Schölkopf,et al. Information-geometric approach to inferring causal directions , 2012, Artif. Intell..

[11] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[12] Bernhard Schölkopf,et al. Inferring deterministic causal relations , 2010, UAI.

[13] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[14] Aapo Hyvärinen,et al. DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model , 2011, J. Mach. Learn. Res..

[15] S. Boucheron,et al. Theory of classification : a survey of some recent advances , 2005 .

[16] Tom Burr,et al. Causation, Prediction, and Search , 2003, Technometrics.

[17] AI Koan,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[18] W. Lockau,et al. Contents , 2015 .

[19] B. Schölkopf,et al. Justifying Information-Geometric Causal Inference , 2014, 1402.2499.

[20] Eric P. Xing,et al. Nonextensive Information Theoretic Kernels on Measures , 2009, J. Mach. Learn. Res..

[21] Bernhard Schölkopf,et al. Detecting the direction of causal time series , 2009, ICML '09.

[22] Aapo Hyvärinen,et al. A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[23] V. Koltchinskii,et al. Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[24] Michael I. Jordan,et al. Convexity, Classification, and Risk Bounds , 2006 .

[25] Kristian Kirsch,et al. Methods Of Modern Mathematical Physics , 2016 .

[26] Arthur Gretton,et al. Learning Theory for Distribution Regression , 2014, J. Mach. Learn. Res..

[27] A. Berlinet,et al. Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[28] Bernhard Schölkopf,et al. Probabilistic latent variable models for distinguishing between cause and effect , 2010, NIPS.

[29] Bernhard Schölkopf,et al. Learning from Distributions via Support Measure Machines , 2012, NIPS.

[30] Bernhard Schölkopf,et al. Causal discovery with continuous additive noise models , 2013, J. Mach. Learn. Res..

[31] K. Fukumizu,et al. Learning via Hilbert Space Embedding of Distributions , 2007 .

[32] S. Mendelson,et al. Aggregation via empirical risk minimization , 2009 .

[33] V. Koltchinskii,et al. Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[34] Tony Jebara,et al. Probability Product Kernels , 2004, J. Mach. Learn. Res..

[35] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[36] Alexander J. Smola,et al. Learning with kernels , 1998 .

[37] Peter L. Bartlett,et al. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[38] Matthias Hein,et al. Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[39] Bernhard Schölkopf,et al. On causal and anticausal learning , 2012, ICML.

[40] W. Rudin,et al. Fourier Analysis on Groups. , 1965 .

[41] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[42] Kenji Fukumizu,et al. Semigroup Kernels on Measures , 2005, J. Mach. Learn. Res..

[43] Andreas Maurer,et al. The Rademacher Complexity of Linear Transformation Classes , 2006, COLT.

[44] Li Li,et al. Support Vector Machines , 2015 .

[45] Mark S. C. Reed,et al. Method of Modern Mathematical Physics , 1972 .

[46] Bernhard Schölkopf,et al. Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks , 2014, J. Mach. Learn. Res..

[47] M. Talagrand,et al. Probability in Banach Spaces: Isoperimetry and Processes , 1991 .