A System Architecture for Running Big Data Workflows in the Cloud

Scientific workflows have become an important paradigm for domain scientists to formalize and structure complex data-intensive scientific processes. The ever-increasing volumes of scientific data motivate researchers to extend scientific workflow management systems (SWFMSs) to utilize the power of Cloud computing to perform big data analyses. Unlike workflows run in traditional on-premise environments such as stand-alone workstations or grids, Cloud workflows rely on dynamically provisioned computing, storage and network resources that are terminated when no longer used. This dynamic and volatile nature of cloud resources as well as other cloud-specific factors introduce a new set of challenges for "Cloud-enabled" SWFMSs. Although few SWFMSs have been integrated with Cloud infrastructures that provide some experience for future research and development, a comprehensive study from an architectural perspective is still missing. To this end, we conduct a hands-on study by running a big data workflow in Amazon EC2, FutureGrid Eucalyptus and OpenStack clouds. From this experience we 1) identify the key challenges for running big data workflows in the cloud, 2) propose a generic implementation-independent system architecture that addresses these challenges, 3) develop a cloud-enabled SWFMS called DATAVIEW that delivers a specific implementation of the proposed architecture. Finally, to validate our proposed architecture we conduct a case study in which we design and run a big data workflow towards addressing EB-scale big data analysis problem in the automotive industry domain.

[1]  Rajkumar Buyya,et al.  Workflow Engine for Clouds , 2011, CloudCom 2011.

[2]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[3]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[4]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[5]  Jianwu Wang,et al.  Early Cloud Experiences with the Kepler Scientific Workflow System , 2012, ICCS.

[6]  Xiao Liu,et al.  A cost-effective strategy for intermediate data storage in scientific cloud workflow systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  Yong Zhao,et al.  Opportunities and Challenges in Running Scientific Workflows on the Cloud , 2011, 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[8]  Alexandru Iosup,et al.  Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Moustafa Ghanem,et al.  Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support , 2012, BMC Bioinformatics.

[10]  Liang-Jie Zhang Editorial: Quality-Driven Service and Workflow Management , 2011, IEEE Trans. Serv. Comput..

[11]  Ewa Deelman,et al.  Experiences using cloud computing for a scientific workflow application , 2011, ScienceCloud '11.

[12]  Ewa Deelman,et al.  Grids and Clouds: Making Workflow Applications Work in Heterogeneous Distributed Environments , 2010, Int. J. High Perform. Comput. Appl..

[13]  Ivona Brandic,et al.  Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds , 2013, Journal of Grid Computing.

[14]  Xiao Liu,et al.  A market-oriented hierarchical scheduling strategy in cloud workflow systems , 2011, The Journal of Supercomputing.

[15]  Thomas Heinis,et al.  Developing scientific workflows from heterogeneous services , 2006, SGMD.

[16]  Miron Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.