Marginalizable Density Models

Probability density models based on deep networks have achieved remarkable success in modeling complex high-dimensional datasets. However, unlike kernel density estimators, modern neural models do not yield marginals or conditionals in closed form, as these quantities require the evaluation of seldom tractable integrals. In this work, we present the marginalizable density model approximator (MDMA), a novel deep network architecture which provides closed form expressions for the probabilities, marginals and conditionals of any subset of the variables. The MDMA learns deep scalar representations for each individual variable and combines them via learned hierarchical tensor decompositions into a tractable yet expressive CDF, from which marginals and conditional densities are easily obtained. We illustrate the advantage of exact marginalizability in several tasks that are out of reach of previous deep network-based density estimation models, such as estimating mutual information between arbitrary subsets of variables, inferring causality by testing for conditional independence, and inference with missing data without the need for data imputation, outperforming state-of-the-art models on these tasks. The model also allows for parallelized sampling with only a logarithmic dependence of the time complexity on the number of variables.

[1]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[2]  Vittorio Giovannetti,et al.  Quantum MERA Channels , 2008 .

[3]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[4]  W. Hackbusch Tensor Spaces and Numerical Tensor Calculus , 2012, Springer Series in Computational Mathematics.

[5]  Fabian Spanhel,et al.  The partial copula: Properties and associated dependence measures , 2015, 1511.06665.

[6]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[7]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[8]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[9]  Arthur Charpentier,et al.  Probit Transformation for Nonparametric Kernel Estimation of the Copula Density , 2014, 1404.4414.

[10]  G. Lugosi,et al.  Consistency of Data-driven Histogram Methods for Density Estimation and Classification , 1996 .

[11]  Eric V. Strobl,et al.  Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery , 2017, Journal of Causal Inference.

[12]  Nicola De Cao,et al.  Block Neural Autoregressive Flow , 2019, UAI.

[13]  R. W. Robinson Counting unlabeled acyclic digraphs , 1977 .

[14]  Demetri Terzopoulos,et al.  Multilinear Analysis of Image Ensembles: TensorFaces , 2002, ECCV.

[15]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[16]  Thomas G. Dietterich,et al.  Anomaly detection in the presence of missing values for weather data quality control , 2018, COMPASS.

[17]  Ivan Kobyzev,et al.  Normalizing Flows: An Introduction and Review of Current Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[19]  Barnabás Póczos,et al.  Transformation Autoregressive Networks , 2018, ICML.

[20]  Naftali Harris,et al.  PC algorithm for nonparanormal graphical models , 2013, J. Mach. Learn. Res..

[21]  Masashi Sugiyama,et al.  Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 2 Applications and Future Perspectives , 2017, Found. Trends Mach. Learn..

[22]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[23]  Arthur Gretton,et al.  Nonlinear directed acyclic structure learning with weakly additive noise models , 2009, NIPS.

[24]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[25]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[26]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[27]  Nikos D. Sidiropoulos,et al.  Tensors for Data Mining and Data Fusion , 2016, ACM Trans. Intell. Syst. Technol..

[28]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[29]  H. White,et al.  A Consistent Characteristic-Function-Based Test for Conditional Independence , 2003 .

[30]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[31]  Amnon Shashua,et al.  On the Expressive Power of Overlapping Architectures of Deep Learning , 2017, ICLR.

[32]  Yoshua Bengio,et al.  Incorporating Functional Knowledge in Neural Networks , 2009, J. Mach. Learn. Res..

[33]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[34]  Shuicheng Yan,et al.  Sharing Residual Units Through Collective Tensor Factorization To Improve Deep Neural Networks , 2018, IJCAI.

[35]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[36]  Olivier Goudet,et al.  Causal Discovery Toolbox: Uncover causal relationships in Python , 2019, 1903.02278.

[37]  Daniel Soudry,et al.  Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization? , 2020, ICML.

[38]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[39]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[40]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[41]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[42]  Alexander J. McNeil,et al.  Quantitative Risk Management: Concepts, Techniques and Tools Revised edition , 2015 .

[43]  Emmanuel J. Candès,et al.  Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[44]  Lorenzo Beretta,et al.  Nearest neighbor imputation algorithms: a critical evaluation , 2016, BMC Medical Informatics and Decision Making.

[45]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[46]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[47]  Bernhard Schölkopf,et al.  A kernel-based causal learning algorithm , 2007, ICML '07.

[48]  Tzee-Ming Huang Testing conditional independence using maximal nonlinear conditional correlation , 2010, 1010.3843.

[49]  Rajen Dinesh Shah,et al.  The hardness of conditional independence testing and the generalised covariance measure , 2018, The Annals of Statistics.

[50]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[51]  W. Hackbusch,et al.  A New Scheme for the Tensor Representation , 2009 .

[52]  Alexandre Lacoste,et al.  Neural Autoregressive Flows , 2018, ICML.

[53]  Anima Anandkumar,et al.  Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[54]  Marina Velikova,et al.  Monotone and Partially Monotone Neural Networks , 2010, IEEE Transactions on Neural Networks.

[55]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[56]  Ward Cheney,et al.  A course in approximation theory , 1999 .

[57]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[58]  Nikos D. Sidiropoulos,et al.  Tensor Decomposition for Signal Processing and Machine Learning , 2016, IEEE Transactions on Signal Processing.

[59]  H. White,et al.  A NONPARAMETRIC HELLINGER METRIC TEST FOR CONDITIONAL INDEPENDENCE , 2008, Econometric Theory.

[60]  Preetam Nandy,et al.  A Review of Some Recent Advances in Causal Inference , 2016, Handbook of Big Data.

[61]  D. W. Scott On optimal and data based histograms , 1979 .

[62]  David Minnen,et al.  Variational image compression with a scale hyperprior , 2018, ICLR.

[63]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[64]  Xiaogang Wang,et al.  Convolutional neural networks with low-rank regularization , 2015, ICLR.

[65]  Hien D. Nguyen,et al.  On approximations via convolution-defined mixture models , 2016, Communications in Statistics - Theory and Methods.

[66]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[67]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[68]  Ricardo Silva,et al.  Neural Likelihoods via Cumulative Distribution Functions , 2018, UAI.

[69]  A. Dasgupta Asymptotic Theory of Statistics and Probability , 2008 .

[70]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[71]  Andrzej Cichocki,et al.  Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions , 2016, Found. Trends Mach. Learn..

[72]  N. R. Hansen,et al.  Testing Conditional Independence via Quantile Regression Based Partial Copulas , 2020, J. Mach. Learn. Res..

[73]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[74]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[75]  Matthias Zwicker,et al.  Learning Generative Models using Denoising Density Estimators , 2020, ArXiv.

[76]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[77]  Yongxin Yang,et al.  Deep Multi-task Representation Learning: A Tensor Factorisation Approach , 2016, ICLR.