Distance-aware virtual cluster performance optimization: A hadoop case study

Cloud computing and big data are becoming two important developing trends in information technology area. However, data-intensive computing has some challenges to work well on virtual machines in cloud computing for virtualized resource competition and complex network communication. Network becomes one of the most notorious bottlenecks, which highlights strategies to lower communication and transmission cost in virtual cluster. In this paper, we present a novel cluster performance optimization strategy named vClusterOpt. vClusterOpt finds out centralized subgraphs of node graph and choose node with the shortest logical distance as kernel node of the subgraph to reduce inter-machine communication and transmission cost under virtual cluster. To calculate logical distance accurately, we define two kinds of logical distance: Logical Communication Distance(LCD) and Logical Transmission Distance(LTD). VM with the shortest LCD with others is used as the communication kernel node who has the most information communication stress, while VM with the shortest LTD is treated as transmission kernel node who has the most data transmission stress. We choose benchmarks running on Hadoop as the represent of data-intensive computing service to demonstrate effectiveness of our approach. Experiments show that an average of 20% performance improvement can get by our distance-aware virtual cluster optimization strategy.

[1]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[2]  William H. Dutton,et al.  Clouds, big data, and smart assets: Ten tech-enabled business trends to watch , 2010 .

[3]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[4]  Thomas Sandholm,et al.  MapReduce optimization using regulated dynamic prioritization , 2009, SIGMETRICS '09.

[5]  Randy H. Katz,et al.  Topology-aware resource allocation for data-intensive workloads , 2011, Comput. Commun. Rev..

[6]  Xin Yang,et al.  Affinity-aware Virtual Cluster Optimization for MapReduce Applications , 2012, 2012 IEEE International Conference on Cluster Computing.

[7]  H. Liu,et al.  Conference on Measurement and modeling of computer systems , 2001 .

[8]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[9]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[10]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[11]  Robert Tappan Morris,et al.  Vivaldi: a decentralized network coordinate system , 2004, SIGCOMM '04.

[12]  Guangwen Yang,et al.  Location-Aware MapReduce in Virtual Cloud , 2011, 2011 International Conference on Parallel Processing.

[13]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Ibrahim Matta,et al.  BRITE: an approach to universal topology generation , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[15]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[16]  H KatzRandy,et al.  Topology-aware resource allocation for data-intensive workloads , 2011 .

[17]  Albert Y. Zomaya,et al.  On Modelling and Prediction of Total CPU Usage for Applications in MapReduce Environments , 2012, ICA3PP.

[18]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Ck Cheng,et al.  The Age of Big Data , 2015 .