Many applications infer the structure of a probabilistic graphical model from data to elucidate the relationships between variables. But how can we train graphical models on a massive data set? In this paper, we show how to construct coresets—compressed data sets which can be used as proxy for the original data and have provably bounded worst case error—for Gaussian dependency networks (DNs), i.e., cyclic directed graphical models over Gaussians, where the parents of each variable are its Markov blanket. Specifically, we prove that Gaussian DNs admit coresets of size independent of the size of the data set. Unfortunately, this does not extend to DNs over members of the exponential family in general. As we will prove, Poisson DNs do not admit small coresets. Despite this worst-case result, we will provide an argument why our coreset construction for DNs can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core DNs on real data sets. The results demonstrate significant gains over no or naive sub-sampling, even in the case of count data. Artificial intelligence and machine learning have achieved considerable successes in recent years, and an ever-growing number of disciplines rely on them. Data is now ubiquitous, and there is great value in understanding the data, e.g., building probabilistic graphical models to elucidate the relationships between variables. In the big data era, however, scalability has become crucial for any useful machine learning approach. In this paper, we consider the problem of training graphical models, in particular, Dependency Networks (Heckerman et al. 2000), on massive data sets. They are cyclic directed graphical models, where the parents of each variable are its Markov blanket and have been proven successful in various tasks, such as collaborative filtering (Heckerman et al. 2000), phylogenetic analysis (Carlson et al. 2008), genetic analysis (Dobra 2009; Phatak et al. 2010), network inference from sequencing data (Allen and Liu 2013), and traffic as well as topic modeling (Hadiji et al. 2015). Specifically, we show that Dependency Networks over Gaussians—arguably one of the most prominent type of distribution in statistical machine learning—admit coresets of size independent of the size of the data set. Coresets are Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. weighted subsets of the data, which guarantee that models fitting them will also provide a good fit for the original data set, and have been studied before for clustering (Badoiu, Har-Peled, and Indyk 2002; Feldman, Faulkner, and Krause 2011; Feldman, Schmidt, and Sohler 2013; Lucic, Bachem, and Krause 2016), classification (Har-Peled, Roth, and Zimak 2007; Har-Peled 2015; Reddi, Póczos, and Smola 2015), regression (Drineas, Mahoney, and Muthukrishnan 2006; 2008; Dasgupta et al. 2009; Geppert et al. 2017), and the smallest enclosing ball problem (Badoiu and Clarkson 2003; 2008; Feldman, Munteanu, and Sohler 2014; Agarwal and Sharathkumar 2015); we refer to (Phillips 2017) for a recent extensive literature overview. Our contribution continues this line of research and generalizes the use of coresets to probabilistic graphical modeling. Unfortunately, this coreset result does not extend to Dependency Networks over members of the exponential family in general. We prove that Dependency Networks over Poisson random variables (Allen and Liu 2013; Hadiji et al. 2015) do not admit (sublinear size) coresets: every single input point is important for the model and needs to appear in the coreset.This is unfortunate when modeling count data— the primary target of Poisson distributions—which is at the center of many scientific endeavors such as citation counts, number of web page hits, counts of procedures in medicine, etc. Therefore, despite our worst-case result, we will provide an argument why our coreset construction for Dependency Networks can still work well in practice on count data. To corroborate our theoretical results, we empirically evaluated the resulting Core Dependency Networks (CDNs) on several real data sets and demonstrate significant gains over no or naive sub-sampling, even for count data. We proceed as follows. We review Dependency Networks (DNs), prove that Gaussian DNs admit sublinear size coresets, and discuss the possibility to generalize this result to count data. Before concluding, we present empirical results. Dependency Networks Most of the existing AI and machine learning literature on graphical models is dedicated to binary, multinomial, or certain classes of continuous (e.g. Gaussian) random variables. Undirected models, aka Markov Random Fields (MRFs), such as Ising (binary random variables) and Potts (multinomial random variables) models have found a lot of applicaThe Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)
[1]
Piotr Indyk,et al.
Approximate clustering via core-sets
,
2002,
STOC '02.
[2]
S. Muthukrishnan,et al.
Relative-Error CUR Matrix Decompositions
,
2007,
SIAM J. Matrix Anal. Appl..
[3]
Pradeep Ravikumar,et al.
Graphical models via univariate exponential family distributions
,
2013,
J. Mach. Learn. Res..
[4]
Pedro M. Domingos,et al.
Sum-product networks: A new deep architecture
,
2011,
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).
[5]
Christian Wietfeld,et al.
LTE Connectivity and Vehicular Traffic Prediction Based on Machine Learning Approaches
,
2015,
2015 IEEE 82nd Vehicular Technology Conference (VTC2015-Fall).
[6]
William J. Wilson,et al.
NetRaVE: constructing dependency networks using sparse linear regression
,
2010,
Bioinform..
[7]
Michael W. Mahoney.
Randomized Algorithms for Matrices and Data
,
2011,
Found. Trends Mach. Learn..
[8]
Dan Feldman,et al.
Smallest enclosing ball for probabilistic data
,
2014,
SoCG.
[9]
Andreas Krause,et al.
Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures
,
2015,
AISTATS.
[10]
Andreas Krause,et al.
Scalable Training of Mixture Models via Coresets
,
2011,
NIPS.
[11]
Mark Rudelson,et al.
Sampling from large matrices: An approach through geometric functional analysis
,
2005,
JACM.
[12]
Genevera I. Allen,et al.
A Local Poisson Graphical Model for Inferring Networks From Sequencing Data
,
2013,
IEEE Transactions on NanoBioscience.
[13]
Joel A. Tropp,et al.
Improved Analysis of the subsampled Randomized Hadamard Transform
,
2010,
Adv. Data Sci. Adapt. Anal..
[14]
David B. Dunson,et al.
Lognormal and Gamma Mixed Negative Binomial Regression
,
2012,
ICML.
[15]
Pankaj K. Agarwal,et al.
Streaming Algorithms for Extent Problems in High Dimensions
,
2010,
SODA '10.
[16]
Christian Sohler,et al.
Random projections for Bayesian regression
,
2015,
Statistics and Computing.
[17]
Alexander J. Smola,et al.
Communication Efficient Coresets for Empirical Loss Minimization
,
2015,
UAI.
[18]
Kristian Kersting,et al.
Poisson Sum-Product Networks: A Deep Architecture for Tractable Multivariate Poisson Distributions
,
2017,
AAAI.
[19]
David Heckerman,et al.
Phylogenetic Dependency Networks: Inferring Patterns of CTL Escape and Codon Covariation in HIV-1 Gag
,
2008,
PLoS Comput. Biol..
[20]
A. Dobra.
Variable selection and dependency networks for genomewide data.
,
2009,
Biostatistics.
[21]
Kenneth L. Clarkson,et al.
Smaller core-sets for balls
,
2003,
SODA '03.
[22]
Sariel Har-Peled.
A Simple Algorithm for Maximum Margin Classification, Revisited
,
2015,
ArXiv.
[23]
Jeff M. Phillips,et al.
Coresets and Sketches
,
2016,
ArXiv.
[24]
Ping Ma,et al.
A statistical perspective on algorithmic leveraging
,
2013,
J. Mach. Learn. Res..
[25]
David P. Woodruff,et al.
Low rank approximation and regression in input sparsity time
,
2013,
STOC '13.
[26]
David Heckerman,et al.
Dependency Networks for Density Estimation, Collaborative Filtering, and Data Visualization
,
2000
.
[27]
Kenneth L. Clarkson,et al.
Optimal core-sets for balls
,
2008,
Comput. Geom..
[28]
Fabian Hadiji,et al.
Poisson Dependency Networks: Gradient Boosted Models for Multivariate Count Data
,
2015,
Machine Learning.
[29]
Ravi Kumar,et al.
The One-Way Communication Complexity of Hamming Distance
,
2008,
Theory Comput..
[30]
J. Besag.
Statistical Analysis of Non-Lattice Data
,
1975
.
[31]
Dan Roth,et al.
Maximum Margin Coresets for Active and Noise Tolerant Learning
,
2007,
IJCAI.
[32]
H. Friedl.
Econometric Analysis of Count Data
,
2002
.
[33]
Rajeev Motwani,et al.
Randomized Algorithms
,
1995,
SIGA.
[34]
S. Muthukrishnan,et al.
Sampling algorithms for l2 regression and applications
,
2006,
SODA '06.
[35]
Anirban Dasgupta,et al.
Sampling algorithms and coresets for ℓp regression
,
2007,
SODA '08.
[36]
Dan Feldman,et al.
Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
,
2013,
SODA.
[37]
Yoshua Bengio,et al.
Deep Generative Stochastic Networks Trainable by Backprop
,
2013,
ICML.
[38]
P. McCullagh,et al.
Generalized Linear Models
,
1984
.
[39]
L. Schulman,et al.
Universal ε-approximators for integrals
,
2010,
SODA '10.