A MapReduce-Based Method for Learning Bayesian Network from Massive Data

Bayesian network (BN) is the popular and important probabilistic graphical model for representing and inferring uncertain knowledge. Learning BN from massive data is the basis for uncertain-knowledge-centered inferences, prediction and decision. The inherence of massive data makes BN learning be adjusted to the large data volume and executed in parallel. In this paper, we proposed a MapReduce-based approach for learning BN from massive data by extending the traditional scoring & search algorithm. First, in the scoring process, we developed map and reduce algorithms for obtaining the required parameters in parallel. Second, in the search process, for each node we developed map and reduce algorithms for scoring all the candidate local structures in parallel and selecting the local optimal structure with the highest score. Thus, the local optimal structures of each node are merged to the global optimal one. Experimental result indicates our proposed method is effective and efficient.

[1]  Ian Gorton,et al.  The Changing Paradigm of Data-Intensive Computing , 2009, Computer.

[2]  Joe Suzuki,et al.  Learning Bayesian Belief Networks Based on the MDL Principle : An Efficient Algorithm Using the Branch and Bound Technique , 1999 .

[3]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[4]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[5]  Hongyan Liu,et al.  Bayesian Network Structure Learning from Attribute Uncertain Data , 2012, WAIM.

[6]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[7]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[8]  Yang Xiang,et al.  Parallel Learning of Belief Networks in Large and Difficult Domains , 2004, Data Mining and Knowledge Discovery.

[9]  Kyuseok Shim,et al.  Web Technologies and Applications , 2014, Lecture Notes in Computer Science.

[10]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[11]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[12]  Satoru Miyano,et al.  Parallel Algorithm for Learning Optimal Bayesian Network Structure , 2011, J. Mach. Learn. Res..

[13]  Shan Wang,et al.  Cleaning Uncertain Streams by Parallelized Probabilistic Graphical Models , 2010, WAIM.

[14]  Sunita Sarawagi,et al.  Probabilistic Graphical Models and their Role in Databases , 2007, VLDB.

[15]  Sergei Vassilvitskii,et al.  Densest Subgraph in Streaming and MapReduce , 2012, Proc. VLDB Endow..

[16]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[17]  Yue Wang,et al.  An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text , 2011, APWeb.

[18]  Shyam Antony,et al.  Data Management Challenges in Cloud Computing Infrastructures , 2010, DNIS.

[19]  Hao Wang,et al.  A Parallel Algorithm for Learning Bayesian Networks , 2007, PAKDD.

[20]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[21]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[22]  Aoying Zhou,et al.  XML Structural Similarity Search Using MapReduce , 2010, WAIM.