A New Method for Vertical Parallelisation of TAN Learning Based on Balanced Incomplete Block Designs

The framework of Bayesian networks is a widely popular formalism for performing belief update under uncertainty. Structure restricted Bayesian network models such as the Naive Bayes Model and Tree-Augmented Naive Bayes (TAN) Model have shown impressive performance for solving classification tasks. However, if the number of variables or the amount of data is large, then learning a TAN model from data can be a time consuming task. In this paper, we introduce a new method for parallel learning of a TAN model from large data sets. The method is based on computing the mutual information scores between pairs of variables given the class variable in parallel. The computations are organised in parallel using balanced incomplete block designs. The results of a preliminary empirical evaluation of the proposed method on large data sets show that a significant performance improvement is possible through parallelisation using the method presented in this paper.

[1]  Douglas R. Stinson,et al.  Combinatorial designs: constructions and analysis , 2003, SIGA.

[2]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[3]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[4]  Daniel M. Gordon The Prime Power Conjecture is True for n < 2, 000, 000 , 1994, Electron. J. Comb..

[5]  Anders L. Madsen,et al.  The Hugin Tool for Probabilistic Graphical Models , 2005, Int. J. Artif. Intell. Tools.

[6]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[7]  Michael Clarke,et al.  Symbolic and Quantitative Approaches to Reasoning and Uncertainty , 1991, Lecture Notes in Computer Science.

[8]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[9]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[10]  W. R. Shao,et al.  Bayesian Networks and Influence Diagrams: A Guide to Construction and Analysis , 2008 .

[11]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[12]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[13]  Marco Scutari,et al.  Learning Bayesian Networks with the bnlearn R Package , 2009, 0908.3817.

[14]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[15]  Anders L. Madsen,et al.  The Hugin Tool for Learning Bayesian Networks , 2003, ECSQARU.

[16]  Weiyi Liu,et al.  A MapReduce-Based Method for Learning Bayesian Network from Massive Data , 2013, APWeb.

[17]  Finn Verner Jensen,et al.  MUNIN: an expert EMG assistant , 1988 .

[18]  R. Fisher An examination of the different possible solutions of a problem in incomplete blocks. , 1940 .

[19]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[20]  Kyuseok Shim,et al.  Web Technologies and Applications , 2014, Lecture Notes in Computer Science.

[21]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[22]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[23]  Anders L. Madsen,et al.  Hugin - The Tool for Bayesian Networks and Influence Diagrams , 2002, Probabilistic Graphical Models.

[24]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[25]  Ole J. Mengshoel,et al.  Accelerating Bayesian network parameter learning using Hadoop and MapReduce , 2012, BigMine '12.