Scalable Learning of k-dependence Bayesian Classifiers under MapReduce

In Data Mining there is a constant need to provide more scalable tools in order to tackle new domains with an increased level of complexity. Over the last few years one of the main challenges in this field is the growing size of the available data, owing to the level of data generation and storage capacities provided by new emergent technology, a range of new computational paradigms and parallel architectures have been proposed. MapReduce got the leading role in the field of Big Data applications since its appearance, and many popular Data Analysis tools and techniques have been successfully adapted to this paradigm. Supervised classification is one of the most common problems in Data Mining, and Bayesian Networks Classifiers (BNC) have become one of the most extended and competitive techniques to approach them. In this paper we propose a parallel definition of the KDB (k-dependence Bayesian classifier) algorithm under the MapReduce framework. We focus on obtaining maximum scalability and flexibility by exploring the concepts of vertical and horizontal parallelism, thus addressing both Big Data and High Dimensional problems simultaneously. We analyse its properties and the advantages of applying it to large datasets of different nature. Finally, an experimental evaluation is performed by testing a Hadoop implementation of our proposal on a high-end cluster of computers.

[1]  Peter J. Tonellato,et al.  Cloud computing for comparative genomics , 2010, BMC Bioinformatics.

[2]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[3]  Jose Miguel Puerta,et al.  Handling numeric attributes when comparing Bayesian network classifiers: does the discretization method matter? , 2011, Applied Intelligence.

[4]  Jose Miguel Puerta,et al.  Structural Learning of Bayesian Networks Via Constrained Hill Climbing Algorithms: Adjusting Trade‐off between Efficiency and Accuracy , 2015, Int. J. Intell. Syst..

[5]  Concha Bielza,et al.  Discrete Bayesian Network Classifiers , 2014, ACM Comput. Surv..

[6]  Laks V. S. Lakshmanan,et al.  Learning influence probabilities in social networks , 2010, WSDM '10.

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[9]  Sören Sonnenburg,et al.  COFFIN: A Computational Framework for Linear SVMs , 2010, ICML.

[10]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[13]  Genshe Chen,et al.  Scalable sentiment classification for Big Data analysis using Naïve Bayes Classifier , 2013, 2013 IEEE International Conference on Big Data.

[14]  Federico Divina,et al.  Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features , 2012, Bioinform..

[15]  Ole J. Mengshoel,et al.  Accelerating Bayesian network parameter learning using Hadoop and MapReduce , 2012, BigMine '12.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  S. Ahmed,et al.  Bayesian Networks and Decision Graphs (2nd ed.), by F. V. Jenson and T. D. Nielsen , 2008 .

[18]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[19]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[20]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[21]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[22]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[23]  Murat Kantarcioglu,et al.  A Comparison of Approaches for Large-Scale Data Mining Utilizing MapReduce in Large-Scale Data Mining , 2010 .

[24]  Anders L. Madsen,et al.  A New Method for Vertical Parallelisation of TAN Learning Based on Balanced Incomplete Block Designs , 2014, Probabilistic Graphical Models.

[25]  David Konopnicki,et al.  Extracting user profiles from large scale data , 2010, MDAC '10.