论文信息 - Scalable Score Computation for Learning Multinomial Bayesian Networks over Distributed Data

Scalable Score Computation for Learning Multinomial Bayesian Networks over Distributed Data

In this paper, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in a scalable manner, which is a fundamental task during learning. We propose a novel approach designed to achieve: (a) scalable score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. Through theoretical analysis, we show that our approach is superior to a MapReduce-style computation in terms of communication bandwidth. Further, it is superior to the batchstyle processing of MapReduce for recomputing scores when new data are available.

[1] Johannes Gehrke,et al. Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[2] Weiyi Liu,et al. A MapReduce-Based Method for Learning Bayesian Network from Massive Data , 2013, APWeb.

[3] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[4] Wei Chen,et al. Massively parallel learning of Bayesian networks with MapReduce for factor relationship analysis , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[5] Stephen P. Boyd,et al. Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[6] Doug Fisher,et al. Learning from Data: Artificial Intelligence and Statistics V , 1996 .

[7] Pedro M. Domingos,et al. Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[8] Nir Friedman,et al. Probabilistic Graphical Models - Principles and Techniques , 2009 .

[9] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[10] Joseph M. Hellerstein,et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[11] David R. Karger,et al. Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[12] Ole J. Mengshoel,et al. Accelerating Bayesian network parameter learning using Hadoop and MapReduce , 2012, BigMine '12.

[13] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[17] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[18] Katherine A. Heller,et al. Bayesian hierarchical clustering , 2005, ICML.

[19] Yunjun Gao,et al. A Parallel Algorithm for Bayesian Network Parameter Learning Based on Factor Graph , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[20] Devavrat Shah,et al. Fast Distributed Algorithms for Computing Separable Functions , 2005, IEEE Transactions on Information Theory.