Workflow-Based High Performance Data Transfer and Ingestion to Support Petascale Simulations on TeraGrid

We report on high performance data transfer and ingestion design carried out in a scientific workflow project to support Southern California Earthquake Center (SCEC) petascale simulations on TeraGrid (TG), which is conducive to utilize the grid resource to pipeline data pre- and post-processing in this workflow simulation. We develop an enhanced prototype framework that brings together Globus Toolkit and advanced MPI batch jobs for reliable and efficient data transfer between heterogeneous supercomputer clusters on TG. The framework automates the whole process of data transfer without human intervention and it can recover automatically from any failures during the transfers. We also examine optimization approaches for ingesting simulation data into the iRODS (Integrated Rule-Oriented Data System) digital library. The average transfer rate from TACC Ranger to iRODS achieves 133MB/sec, 5 times faster than conventional methods. Experiments performed on TG clusters demonstrated that these concurrent data transfer and ingestion mechanisms can shorten the processing time of the scientific workflow and significantly reduce the load as well.

[1]  Ian T. Foster,et al.  Data management and transfer in high-performance computational grid environments , 2002, Parallel Comput..

[2]  Douglas Thain,et al.  The Kangaroo approach to data movement on the Grid , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[3]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[4]  Reagan Moore,et al.  The TeraShake Computational Platform for Large-Scale Earthquake Simulations , 2009 .

[5]  Satoshi Matsuoka,et al.  Grid Datafarm Architecture for Petascale Data Intensive Computing , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[6]  Van Jacobson,et al.  TCP Extensions for High Performance , 1992, RFC.

[7]  William E. Allcock,et al.  Reliable file transfer in Grid environments , 2002, 27th Annual IEEE Conference on Local Computer Networks, 2002. Proceedings. LCN 2002..

[8]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.

[9]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[10]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.