BDAP: A Big Data Placement Strategy for Cloud-Based Scientific Workflows

In this new era of Big Data, there is a growing need to enable scientific workflows to perform computations at a scale far exceeding a single workstation's capabilities. When running such data intensive workflows in the cloud distributed across several physical locations, the execution time and the resource utilization efficiency highly depends on the initial placement and distribution of the input datasets across these multiple virtual machines in the Cloud. In this paper, we propose BDAP (Big DAta Placement strategy), a strategy that improves workflow performance by minimizing data movement across multiple virtual machines. In this work, we 1) formalize the data placement problem in scientific workflows, 2) propose a data placement algorithm that considers both initial input dataset and intermediate datasets obtained during workflow run, and 3) perform extensive experiments in the distributed environment to verify that our proposed strategy provides an effective data placement solution to distribute and place big datasets at the appropriate virtual machines in the Cloud within reasonable time.

[1]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[2]  Jing Hua,et al.  Service-Oriented Architecture for VIEW: A Visual Scientific Workflow Management System , 2008, 2008 IEEE International Conference on Services Computing.

[3]  Bora Uçar,et al.  Integrated data placement and task assignment for scientific workflows in clouds , 2011, DIDC '11.

[4]  Yong Zhao,et al.  Opportunities and Challenges in Running Scientific Workflows on the Cloud , 2011, 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[5]  Miron Livny,et al.  Data placement for scientific applications in distributed environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[6]  Samir Khuller,et al.  Data Placement and Replica Selection for Improving Co-location in Distributed Environments , 2013, ArXiv.

[7]  Miron Livny,et al.  A framework for reliable and efficient data placement in distributed computing systems , 2005, J. Parallel Distributed Comput..

[8]  Ewa Deelman,et al.  Scientific workflows and clouds , 2010, ACM Crossroads.

[9]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[10]  Venkata Subba Reddy,et al.  Data Management Challenges In Cloud Computing Infrastructures , 2014 .

[11]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[12]  Cees T. A. M. de Laat,et al.  Addressing big data issues in Scientific Data Infrastructure , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[13]  Chen Yi,et al.  A Data Placement Strategy Based on Genetic Algorithm for Scientific Workflows , 2012, 2012 Eighth International Conference on Computational Intelligence and Security.

[14]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[15]  Jianwu Wang,et al.  Big Data Applications Using Workflows for Data Parallel Computing , 2014, Computing in Science & Engineering.