Improving Fault Diagnosis Performance Using Hadoop MapReduce for Efficient Classification and Analysis of Large Data Sets

Underpinning a significant amount of the mass quantities of data, virtualization technology is a key element of utility cloud and an area in which monitoring is a special challenge. The monitoring of large, complex systems requires high accuracy, low latency, and near-real-time fault detection and anomaly analysis along with optimization enactment and actions for corrections. For this paper, we investigated a fine-grained fault-tolerance mechanism with newly proposed algorithms for the analysis of large datasets that are based on the Hadoop MapReduce platform, and we implement a Naïve Bayes Classifier (NBC) algorithm with Hadoop MapReduce to achieve high-performance and efficient classification for the analysis procedure that occurs in virtualization and utility cloud. Evaluation results show that the accuracy of our proposed method using Hadoop MapReduce approaches 89.80% as the size of the data sets increases. We demonstrate that our model is scalable to large data sets of virtual machine (VM) component utilization metrics with increased accuracy, low latency, and machine learning ability.

[2]  Antonio Pescapè,et al.  Cloud monitoring: A survey , 2013, Comput. Networks.

[3]  Kishor S. Trivedi,et al.  Availability analysis of blade server systems , 2008, IBM Syst. J..

[4]  Maria Kateri,et al.  Contingency Table Analysis , 2014 .

[5]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[6]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[7]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[8]  Craig A. Mertler,et al.  Advanced and multivariate statistical methods: Practical application and interpretation: Sixth edition , 2001 .

[9]  M. Ramesh,et al.  A comparative study of various clustering techniques on big data sets using Apache Mahout , 2016, 2016 3rd MEC International Conference on Big Data and Smart City (ICBDSC).

[10]  Adam Barker,et al.  Cloud cover: monitoring large-scale clouds with Varanus , 2015, Journal of Cloud Computing.

[11]  S. Lakshmivarahan,et al.  Probability and Random Processes , 2007 .

[12]  Lakshman S. Thakur,et al.  A big data MapReduce framework for fault diagnosis in cloud-based manufacturing , 2016 .

[13]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Hervé Abdi,et al.  Correspondence Analysis , 2014, Encyclopedia of Social Network Analysis and Mining.

[15]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Li-Der Chou,et al.  Implement Efficient Data Integrity for Cloud Distributed File System Using Merkle Hash Tree , 2014 .

[17]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[18]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[19]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Ligang Zhou,et al.  Posterior probability based ensemble strategy using optimizing decision directed acyclic graph for multi-class classification , 2017, Inf. Sci..

[21]  Michael Joswig,et al.  The polymake XML File Format , 2016, ICMS.

[22]  Xingang Wang,et al.  An Improved Weighted Naive Bayesian Classification Algorithm Based on Multivariable Linear Regression Model , 2016, 2016 9th International Symposium on Computational Intelligence and Design (ISCID).

[23]  Francesco Marcelloni,et al.  A MapReduce solution for associative classification of big data , 2016, Inf. Sci..

[24]  Vinayak Ashok Bharadi,et al.  Virtual Machine Monitoring in Cloud Computing , 2016 .

[25]  Ashish Gupta Learning Apache Mahout Classification: build and personalize your own classifiers using Apache mahout , 2015 .

[26]  Yao Zhao,et al.  AFDI: A Virtualization-based Accelerated Fault Diagnosis Innovation for High Availability Computing , 2015, ArXiv.

[27]  I. David P. Doane Ii. Lori E. Seward,et al.  Applied statistics in business and economics , 2006 .

[28]  Hongwei Liu,et al.  Utility Cloud: A Novel Approach for Diagnosis and Self-healing Based on the Uncertainty in Anomalous Metrics , 2017, ICMSS '17.

[29]  John Paul Martin,et al.  System Performance Evaluation of Para Virtualization, Container Virtualization, and Full Virtualization Using Xen, OpenVZ, and XenServer , 2014, 2014 Fourth International Conference on Advances in Computing and Communications.

[30]  Chih-Yuan Chou,et al.  Efficient concurrent virtual machine scheduling for Xen hypervisors , 2016 .

[31]  Alex Holmes Hadoop in Practice , 2012 .

[32]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[33]  Vincenzo Piuri,et al.  Fault Tolerance Management in Cloud Computing: A System-Level Perspective , 2013, IEEE Systems Journal.

[34]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[35]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[36]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[37]  Xi He,et al.  Cloud Computing: a Perspective Study , 2010, New Generation Computing.