Distributed Estimation, Information Loss and Exponential Families

Distributed learning of probabilistic models from multiple data repositories with minimum communication is increasingly important. We study a simple communication-efficient learning framework that first calculates the local maximum likelihood estimates (MLE) based on the data subsets, and then combines the local MLEs to achieve the best possible approximation to the global MLE given the whole dataset. We study this framework's statistical properties, showing that the efficiency loss compared to the global setting relates to how much the underlying distribution families deviate from full exponential families, drawing connection to the theory of information loss by Fisher, Rao and Efron. We show that the "full-exponential-family-ness" represents the lower bound of the error rate of arbitrary combinations of local MLEs, and is achieved by a KL-divergence-based combination method but not by a more common linear combination method. We also study the empirical properties of both methods, showing that the KL method significantly outperforms linear combination in practical settings with issues such as model misspecification, non-convexity, and heterogeneous data partitions.

[1]  B. Efron Defining the Curvature of a Statistical Problem (with Applications to Second Order Efficiency) , 1975 .

[2]  Xiangyu Wang,et al.  Parallel MCMC via Weierstrass Sampler , 2013, ArXiv.

[3]  Maria-Florina Balcan,et al.  Distributed PCA and k-Means Clustering , 2013 .

[4]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference: Kass/Geometrical , 1997 .

[5]  Chong Wang,et al.  Asymptotically Exact, Embarrassingly Parallel MCMC , 2013, UAI.

[6]  O. Barndorff-Nielsen Information and Exponential Families in Statistical Theory , 1980 .

[7]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[8]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[9]  Jayanta K. Ghosh,et al.  Higher Order Asymptotics , 1994 .

[10]  Calyampudi R. Rao Criteria of estimation in large samples , 1965 .

[11]  Joydeep Ghosh,et al.  Distributed learning using generative models , 2006 .

[12]  Edward I. George,et al.  Bayes and big data: the consensus Monte Carlo algorithm , 2016, Big Data and Information Theory.

[13]  Georgios B. Giannakis,et al.  Distributed Clustering Using Wireless Sensor Networks , 2011, IEEE Journal of Selected Topics in Signal Processing.

[14]  H. Vincent Poor,et al.  Distributed learning in wireless sensor networks , 2005, IEEE Signal Processing Magazine.

[15]  Trevor J. Sweeting,et al.  On conditional weak convergence , 1989 .

[16]  Ohad Shamir,et al.  Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation , 2013, NIPS.

[17]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[18]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[19]  Alfred O. Hero,et al.  Distributed Learning of Gaussian Graphical Models via Marginal Likelihoods , 2013, AISTATS.

[20]  Rory A. Fisher,et al.  Theory of Statistical Estimation , 1925, Mathematical Proceedings of the Cambridge Philosophical Society.

[21]  G. P. Steck,et al.  Limit theorems for conditional distributions , 1957 .

[22]  Xiangyu Wang,et al.  Parallelizing MCMC via Weierstrass Sampler , 2013, 1312.4605.

[23]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[24]  Qiang Liu,et al.  Distributed Parameter Estimation via Pseudo-likelihood , 2012, ICML.

[25]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[26]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[27]  Simon Haykin,et al.  Selected topics in signal processing , 1989 .