Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization

In the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and unstructured, residing in distributed storage systems. However, great opportunity comes with great challenges: Hive query performance is impacted by many factors which makes capacity planning and tuning for Hive cluster extremely difficult. These factors include system software stacks (Hive, MapReduce framework, JVM and OS), cluster hardware configurations (processor, memory, storage, and network) and HIVE data models and distributions. Current planning methods are mostly trial-and-error or very high-level estimation based. These approaches are far from efficient and accurate, especially with the increasing software stack complexity, hardware diversity, and unavoidable data skew in distributed database system. In this paper, we propose a Hive simulation framework based on CSMethod, which simulates the whole hive query execution life cycle, including query plan generation and MapReduce task execution. The framework is validated using typical query operations with varying changes in hardware, software and workload parameters, showing high accuracy and fast simulation speed. We also demonstrate the application of this framework with two real-world use cases: helping customers to perform capacity planning and estimate business query response time before system provisioning.

[1]  T. V. Gopal,et al.  A MR Simulator in Facilitating Cloud Computing , 2013 .

[2]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[3]  Junfeng Yang,et al.  Optimizing Data Partitioning for Data-Parallel Computing , 2011, HotOS.

[4]  Fusheng Wang,et al.  YSmart: Yet Another SQL-to-MapReduce Translator , 2011, 2011 31st International Conference on Distributed Computing Systems.

[5]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[6]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[7]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[8]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[9]  Wei Zhou,et al.  Simulating Big Data Clusters for System Planning, Evaluation, and Optimization , 2014, 2014 43rd International Conference on Parallel Processing.

[10]  Songlin Hu,et al.  QMapper: a tool for SQL optimization on hive using query rewriting , 2013, WWW '13 Companion.

[11]  Luciana Arantes,et al.  MRSG - A MapReduce simulator over SimGrid , 2013, Parallel Comput..

[12]  Mauro Iacono,et al.  Modeling apache hive based applications in big data architectures , 2013, VALUETOOLS.

[13]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[14]  Mauro Iacono,et al.  A Performance Modeling Language For Big Data Architectures , 2013, ECMS.