Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.

[1]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[2]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[3]  H.M. Monti,et al.  Just-in-time staging of large input data for supercomputing jobs , 2008, 2008 3rd Petascale Data Storage Workshop.

[4]  J. L. V. Lewandowski,et al.  Global gyrokinetic particle simulation of turbulence and transport in realistic tokamak geometry , 2005 .

[5]  Douglas Thain,et al.  The Kangaroo approach to data movement on the Grid , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[6]  Ali Raza Butt,et al.  /scratch as a cache: rethinking HPC center scratch storage , 2009, ICS.

[7]  Y. Charlie Hu,et al.  Kosha: A Peer-to-Peer Enhancement for the Network File System , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Ludmila Cherkasova,et al.  FastReplica: Efficient Large File Distribution Within Content Delivery Networks , 2003, USENIX Symposium on Internet Technologies and Systems.

[9]  James Arthur Kohl,et al.  The Neutron Science TeraGrid Gateway: a TeraGrid science gateway to support the Spallation Neutron Source: Research Articles , 2007 .

[10]  James Arthur Kohl,et al.  The Neutron Science TeraGrid Gateway: a TeraGrid science gateway to support the Spallation Neutron Source , 2007, Concurr. Comput. Pract. Exp..

[11]  Suresh Marru,et al.  The LEAD Portal: a TeraGrid gateway and application service architecture , 2007, Concurr. Comput. Pract. Exp..

[12]  Micah Beck,et al.  The Internet Backplane Protocol: Storage in the Network , 1999 .

[13]  KyoungSoo Park,et al.  Scale and Performance in the CoBlitz Large-File Distribution Service , 2006, NSDI.

[14]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[15]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[16]  Alma Riska,et al.  Idle Read After Write - IRAW , 2008, USENIX Annual Technical Conference.

[17]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[18]  Chao Wang,et al.  Optimizing center performance through coordinated data staging, scheduling and recovery , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[20]  Suresh Marru,et al.  The LEAD Portal: a TeraGrid gateway and application service architecture: Research Articles , 2007 .

[21]  Ali Raza Butt,et al.  Timely offloading of result-data in HPC centers , 2008, ICS '08.

[22]  P. Maymounkov Online codes , 2002 .

[23]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.

[24]  Lustre : A Scalable , High-Performance File System Cluster , 2003 .

[25]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[26]  Joel H. Saltz,et al.  Using overlays for efficient data transfer over shared wide-area networks , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[28]  Pablo Rodriguez,et al.  Parallel-access for mirror sites in the Internet , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[29]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[30]  Rob Sherwood,et al.  Slurpie: a cooperative bulk data transfer protocol , 2004, IEEE INFOCOM 2004.

[31]  Jennifer M. Schopf,et al.  Predicting sporadic grid data transfers , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[32]  Ying Ding,et al.  Algorithms for High Performance, Wide-Area Distributed File Downloads , 2003, Parallel Process. Lett..

[33]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[34]  Amin Vahdat,et al.  Using Random Subsets to Build Scalable Network Services , 2003, USENIX Symposium on Internet Technologies and Systems.

[35]  Ian T. Foster,et al.  GASS: a data movement and access service for wide area computing systems , 1999, IOPADS '99.

[36]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[37]  James S. Plank,et al.  Downloading replicated, wide-area files - a framework and empirical evaluation , 2004, Third IEEE International Symposium on Network Computing and Applications, 2004. (NCA 2004). Proceedings..

[38]  Larry L. Peterson,et al.  Reliability and Security in the CoDeeN Content Distribution Network , 2004, USENIX Annual Technical Conference, General Track.

[39]  Scott Klasky,et al.  High performance threaded data streaming for large scale simulations , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[40]  Amin Vahdat,et al.  Bullet: high bandwidth data dissemination using an overlay mesh , 2003, SOSP '03.

[41]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[42]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC.

[43]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[44]  Cameron Kiddle,et al.  A GridFTP Overlay Network Service , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[45]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.