论文信息 - Automated and Portable Hadoop Cluster Orchestration on Clouds with Occopus for Big Data Applications

Automated and Portable Hadoop Cluster Orchestration on Clouds with Occopus for Big Data Applications

Apache Hadoop [1], an open-source software framework for storing data in a distributed cluster environment and running applications to process this large amount of data in a fast and efficient way. In the last few years, Hadoop has become a very popular system for analyzing Big Data with its MapReduce [2] framework introduced by Google in 2004. Many scientific applications, such as weather forecasting [3], DNA sequencing [4], and molecular dynamics [5], have been parallelized using Hadoop. However, the deployment of a fully functional Hadoop cluster is not a trivial task, it is currently not in line with the capabilities of the data scientists, and therefore there is still a significant barrier for this technology to spread among data scientists. Combining Hadoop, Cloud and an orchestration tool for dynamically build up Hadoop clusters would help these scientists run their Big Data applications. Complex virtual infrastructures, like Hadoop, with all of its configuration and network design, needs special planning, care and skills by the end-users to have proper functioning Hadoop cluster. One of our main targeted user groups is the Hungarian academic research community and their new computing infrastructure, the MTA Cloud. This paper focuses on utilizing a hybrid, cloud orchestration tool called Occopus [6]. The solution presented in this paper, provides automatic deployment of a fully functional Hadoop cluster without the need for low level understanding of Hadoop architecture or cloud computing. Moreover, (1) it is portable, since the solution does not depend on any cloud-specific feature, (2) it is scalable by utilizing Occopus and Hadoop dynamicity, (3) it does not require any prepared image, (4) it gives the possibility to fine-tune the configuration of the Hadoop components for advanced users and finally (5) it supports short or long-term usage scenarios.

Róbert Lovas | József Kovács | Enikő Nagy

[1] Zhiqiang Ma,et al. Hadoop-based ARIMA Algorithm and its Application in Weather Forecast , 2013 .

[2] Michael C. Schatz,et al. CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[3] Péter Kacsuk,et al. One Click Cloud Orchestrator: Bringing Complex Applications Effortlessly to the Clouds , 2014, Euro-Par Workshops.

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Hong Tang,et al. Molecular dynamics simulation: Implementation and optimization based on Hadoop , 2012, 2012 8th International Conference on Natural Computation.