A Parallel and Incremental Approach for Data-Intensive Learning of Bayesian Networks

Bayesian network (BN) has been adopted as the underlying model for representing and inferring uncertain knowledge. As the basis of realistic applications centered on probabilistic inferences, learning a BN from data is a critical subject of machine learning, artificial intelligence, and big data paradigms. Currently, it is necessary to extend the classical methods for learning BNs with respect to data-intensive computing or in cloud environments. In this paper, we propose a parallel and incremental approach for data-intensive learning of BNs from massive, distributed, and dynamically changing data by extending the classical scoring and search algorithm and using MapReduce. First, we adopt the minimum description length as the scoring metric and give the two-pass MapReduce-based algorithms for computing the required marginal probabilities and scoring the candidate graphical model from sample data. Then, we give the corresponding strategy for extending the classical hill-climbing algorithm to obtain the optimal structure, as well as that for storing a BN by <;key, value> pairs. Further, in view of the dynamic characteristics of the changing data, we give the concept of influence degree to measure the coincidence of the current BN with new data, and then propose the corresponding two-pass MapReduce-based algorithms for BNs incremental learning. Experimental results show the efficiency, scalability, and effectiveness of our methods.

[1]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[2]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Yang Xiang,et al.  Acquisition of Causal Models for Local Distributions in Bayesian Networks , 2014, IEEE Transactions on Cybernetics.

[5]  Eric Granger,et al.  Incremental learning of privacy-preserving Bayesian networks , 2013, Appl. Soft Comput..

[6]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[7]  Sergei Vassilvitskii,et al.  Densest Subgraph in Streaming and MapReduce , 2012, Proc. VLDB Endow..

[8]  Yang Xiang,et al.  Parallel Learning of Belief Networks in Large and Difficult Domains , 2004, Data Mining and Knowledge Discovery.

[9]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[10]  Philippe Leray,et al.  Local Skeleton Discovery for Incremental Bayesian Network Structure Learning , 2011 .

[11]  Fengzhan Tian,et al.  Incremental learning of Bayesian networks with hidden variables , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[13]  Afif Masmoudi,et al.  A New Learning Structure Heuristic of Bayesian Networks from Data , 2012, MLDM.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[16]  Weiru Liu,et al.  Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[17]  David Sontag,et al.  SparsityBoost: A New Scoring Function for Learning Bayesian Network Structure , 2013, UAI.

[18]  Nir Friedman,et al.  Sequential Update of Bayesian Network Structure , 1997, UAI.

[19]  Joe Suzuki,et al.  Learning Bayesian Belief Networks Based on the MDL Principle : An Efficient Algorithm Using the Branch and Bound Technique , 1999 .

[20]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[21]  Ole J. Mengshoel,et al.  Accelerating Bayesian network parameter learning using Hadoop and MapReduce , 2012, BigMine '12.

[22]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[23]  Yue Wang,et al.  An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text , 2011, APWeb.

[24]  Wai Lam,et al.  Bayesian Network Refinement Via Machine Learning Approach , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Wai Lam,et al.  A Distributed Learning Algorithm for Bayesian Inference Networks , 2002, IEEE Trans. Knowl. Data Eng..

[26]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[27]  Shaohua Tan,et al.  Incremental learning Bayesian network structures efficiently , 2010, 2010 11th International Conference on Control Automation Robotics & Vision.

[28]  Ian Gorton,et al.  The Changing Paradigm of Data-Intensive Computing , 2009, Computer.

[29]  Andrés Cano,et al.  A Method for Integrating Expert Knowledge When Learning Bayesian Networks From Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[30]  Jie Cheng,et al.  Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory , 1999 .

[31]  Venkata Subba Reddy,et al.  Data Management Challenges In Cloud Computing Infrastructures , 2014 .

[32]  James J. Little,et al.  Incremental Learning for Video-Based Gait Recognition With LBP Flow , 2013, IEEE Transactions on Cybernetics.

[33]  Cong Yu,et al.  Data Cube Materialization and Mining over MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[34]  Qiang Ji,et al.  Efficient Structure Learning of Bayesian Networks using Constraints , 2011, J. Mach. Learn. Res..

[35]  Dan Klein,et al.  Fully distributed EM for very large datasets , 2008, ICML '08.

[36]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[37]  Hao Wang,et al.  A Parallel Algorithm for Learning Bayesian Networks , 2007, PAKDD.