Automated and Portable Hadoop Cluster Orchestration on Clouds with Occopus for Big Data Applications

Apache Hadoop [1], an open-source software framework for storing data in a distributed cluster environment and running applications to process this large amount of data in a fast and efficient way. In the last few years, Hadoop has become a very popular system for analyzing Big Data with its MapReduce [2] framework introduced by Google in 2004. Many scientific applications, such as weather forecasting [3], DNA sequencing [4], and molecular dynamics [5], have been parallelized using Hadoop. However, the deployment of a fully functional Hadoop cluster is not a trivial task, it is currently not in line with the capabilities of the data scientists, and therefore there is still a significant barrier for this technology to spread among data scientists. Combining Hadoop, Cloud and an orchestration tool for dynamically build up Hadoop clusters would help these scientists run their Big Data applications. Complex virtual infrastructures, like Hadoop, with all of its configuration and network design, needs special planning, care and skills by the end-users to have proper functioning Hadoop cluster. One of our main targeted user groups is the Hungarian academic research community and their new computing infrastructure, the MTA Cloud. This paper focuses on utilizing a hybrid, cloud orchestration tool called Occopus [6]. The solution presented in this paper, provides automatic deployment of a fully functional Hadoop cluster without the need for low level understanding of Hadoop architecture or cloud computing. Moreover, (1) it is portable, since the solution does not depend on any cloud-specific feature, (2) it is scalable by utilizing Occopus and Hadoop dynamicity, (3) it does not require any prepared image, (4) it gives the possibility to fine-tune the configuration of the Hadoop components for advanced users and finally (5) it supports short or long-term usage scenarios.