PWLM3-based automatic performance model estimation method for HDFS write and read operations

There is a growing need for the development of an automatic performance model estimation method for Hadoop Distributed File System (HDFS) write and read (W/R) operations in order to deal with constant software improvement and updates, parameter configuration changes, hardware heterogeneity, and their Quality of Service (QoS) evaluation. Extant research based on single linear system model has a limited ability to explain the performance variations due to changes in HDFS parameters such as block size. These variations reveal some typical characteristics of nonlinear systems and are an obstacle in achieving effective automatic performance estimation. In order to deal with this challenge, a piecewise-linear multi-model modeling (PWLM 3 )-based automatic performance model estimation method is proposed for HDFS W/R performance. In the proposed method, a standard model base is built to standardize the model representation of every submodel. Moreover, a cluster quality assessment strategy is applied to evaluate the optimal number of submodels, and a submodel selection strategy is implemented to construct performance model candidates and improve the computation efficiency of the proposed method. In addition, Levenberg-Marquardt (LM) and Universal Global Optimization (UGO) algorithms are adopted to estimate the values of switch points and identify undetermined parameters of performance model candidates. Then the performance model is selected among these candidates according to Root Mean Squared Error (RMSE) indicator. Experimental results demonstrate that the PWLM 3 -based performance model provides a good understanding and description of nonlinear characteristics of HDFS W/R performance and achieves better identification precision than a single linear system model-based one. A PWLM 3 -based automatic performance model estimation method is proposed.A model base is built to standardize the model representation of submodels.A cluster quality assessment strategy is used to evaluate the number of submodels.A submodel selection strategy is applied to build performance model candidate set.

[1]  Feng Ye,et al.  Modeling of Distributed File Systems for Practical Performance Analysis , 2014, IEEE Transactions on Parallel and Distributed Systems.

[2]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[3]  P. Olver Nonlinear Systems , 2013 .

[4]  Waheed Iqbal,et al.  Adaptive resource provisioning for read intensive multi-tier applications in the cloud , 2011, Future Gener. Comput. Syst..

[5]  William Wang End-to-end Tracing in HDFS , 2011 .

[6]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Lavanya Ramakrishnan,et al.  Performance evaluation of a MongoDB and hadoop platform for scientific data analysis , 2013, Science Cloud '13.

[8]  Yi-Kuei Lin,et al.  Performance indicator evaluation for a cloud computing system from QoS viewpoint , 2013 .

[9]  Jun Han,et al.  A multi-model framework to implement self-managing control systems for QoS management , 2011, SEAMS '11.

[10]  Reda Albodour,et al.  High level QoS-driven model for Grid applications in a simulated environment , 2012, Future Gener. Comput. Syst..

[11]  Qinghua Zheng,et al.  An optimized approach for storing and accessing small files on cloud storage , 2012, J. Netw. Comput. Appl..

[12]  Li-Zhi Liao,et al.  Convergence analysis of the Levenberg–Marquardt method , 2007, Optim. Methods Softw..

[13]  Changyong Liang,et al.  Combining QoS prediction and customer satisfaction estimation to solve cloud service trustworthiness evaluation problems , 2014, Knowl. Based Syst..

[14]  Dhabaleswar K. Panda,et al.  Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[15]  Huaqiang Yuan,et al.  A Relevance and Redundancy Aware Performance Feature Selection Approach for Distributed File System , 2016 .

[16]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[17]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[18]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[19]  Jing Zhao,et al.  A decomposition-based approach for service composition with global QoS guarantees , 2012, Inf. Sci..

[20]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[21]  Roderick Murray-Smith,et al.  Multiple Model Approaches to Modelling and Control , 1997 .

[22]  Rui Zhang,et al.  Effective Quality of Service Differentiation for Real-world Storage Systems , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[23]  Quan Zhang,et al.  Courier: Multi-dimensional QoS guarantees for the consolidated storage system , 2014, Future Gener. Comput. Syst..

[24]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[25]  Jacek Kitowski,et al.  QoS-based storage resources provisioning for grid applications , 2013, Future Gener. Comput. Syst..

[26]  Harry G. Perros,et al.  Service Performance and Analysis in Cloud Computing , 2009, 2009 Congress on Services - I.

[27]  Tiezhu Zhao,et al.  Analysis of Distributed File Systems on Virtualized Cloud Computing Environment , 2014 .

[28]  Shuigeng Zhou,et al.  A RAMCloud Storage System based on HDFS: Architecture, implementation and evaluation , 2013, J. Syst. Softw..

[29]  Daniel Pakkala,et al.  A survey of methods and approaches for reliable dynamic service compositions , 2013, Service Oriented Computing and Applications.

[30]  Eduardo Sontag Nonlinear regulation: The piecewise linear approach , 1981 .

[31]  Luis Carlos Erpen De Bona,et al.  A QoS aware non-work-conserving disk scheduler , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[32]  Fabio Checconi,et al.  High Throughput Disk Scheduling with Fair Bandwidth Distribution , 2010, IEEE Transactions on Computers.

[33]  Vasudeva Varma,et al.  Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework , 2012, Future Gener. Comput. Syst..

[34]  GhemawatSanjay,et al.  The Google file system , 2003 .

[35]  Christoph Reich,et al.  Key Performance Indicators for Cloud Computing SLAs , 2013 .

[36]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Oded Maimon,et al.  Evaluation of gene-expression clustering via mutual information distance measure , 2007, BMC Bioinformatics.

[38]  Prashant Pandey,et al.  Cloud Analytics: Do We Really Need to Reinvent the Storage Stack? , 2009, HotCloud.

[39]  Qinghua Zheng,et al.  Performance models and dynamic characteristics analysis for HDFS write and read operations: A systematic view , 2014, J. Syst. Softw..

[40]  Jeffrey S. Chase,et al.  Automated control for elastic storage , 2010, ICAC '10.

[41]  Dimitrios I. Fotiadis,et al.  Modifications of the construction and voting mechanisms of the Random Forests Algorithm , 2013, Data Knowl. Eng..

[42]  Paulo F. Pires,et al.  A systematic survey of service identification methods , 2014, Service Oriented Computing and Applications.