Design and Evaluation of a Simple Data Interface for Efficient Data Transfer across Diverse Storage

Modern science and engineering computing environments often feature storage systems of different types, from parallel file systems in high-performance computing centers to object stores operated by cloud providers. To enable easy, reliable, secure, and performant data exchange among these different systems, we propose Connector, a plug-able data access architecture for diverse, distributed storage. By abstracting low-level storage system details, this abstraction permits a managed data transfer service (Globus, in our case) to interact with a large and easily extended set of storage systems. Equally important, it supports third-party transfers: that is, direct data transfers from source to destination that are initiated by a third-party client but do not engage that third party in the data path. The abstraction also enables management of transfers for performance optimization, error handling, and end-to-end integrity. We present the Connector design, describe implementations for different storage services, evaluate tradeoffs inherent in managed vs. direct transfers, motivate recommended deployment options, and propose a model-based method that allows for easy characterization of performance in different contexts without exhaustive benchmarking.

[1]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[2]  David D. Clark,et al.  The structuring of systems using upcalls , 1985, SOSP '85.

[3]  Eli Dart,et al.  The Science DMZ: A network design pattern for data-intensive science , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  E.J. Whitehead,et al.  WEBDAV: IETF Standard for Collaborative Authoring on the Web , 1998, IEEE Internet Comput..

[5]  A. Rajasekar,et al.  Integration of Cloud Storage with Data Grids , 2009 .

[6]  Ian T. Foster,et al.  Cross-geography scientific data transferring trends and behavior , 2018, HPDC.

[7]  James Gallagher,et al.  OPeNDAP: Accessing data in a distributed, heterogeneous environment , 2003, Data Sci. J..

[8]  Tevfik Kosar,et al.  Application Level High Speed Transfer Optimization Based on Historical Analysis and Real-time Tuning , 2017, ArXiv.

[9]  Paul Rad,et al.  Chameleon: A Scalable Production Testbed for Computer Science Research , 2019, Contemporary High Performance Computing.

[10]  Tevfik Kosar,et al.  HARP: Predictive Transfer Optimization Based on Historical Analysis and Real-Time Probing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Tevfik Kosar,et al.  Application-Level Optimization of Big Data Transfers through Pipelining, Parallelism and Concurrency , 2016, IEEE Transactions on Cloud Computing.

[12]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[13]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  Rajkumar Kettimuthu,et al.  Data Transfer between Scientific Facilities – Bottleneck Analysis, Insights and Optimizations , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[15]  Christine L. Borgman,et al.  On the Reuse of Scientific Data , 2017, Data Sci. J..

[16]  Franck Cappello,et al.  Transferring a petabyte in a day , 2018, Future Gener. Comput. Syst..

[17]  Michael E. Papka,et al.  Characterization and identification of HPC applications at leadership computing facility , 2020, ICS.

[18]  Steven Tuecke,et al.  GridFTP: Protocol Extensions to FTP for the Grid , 2001 .

[19]  Prasanna Balaprakash,et al.  Building a Wide-Area File Transfer Performance Predictor: An Empirical Study , 2018, MLN.

[20]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[21]  Jewel H. Ward,et al.  Classifying Implemented Policies and Identifying Factors in Machine-Level Policy Sharing within the integrated Rule-Oriented Data System (iRODS) , 2011 .

[22]  Data sharing and the future of science , 2018, Nature Communications.

[23]  Philip L. Frana Before the Web There Was Gopher , 2004, IEEE Annals of the History of Computing.

[24]  MAPFS: A flexible multiagent parallel file system for clusters , 2006, Future Gener. Comput. Syst..

[25]  R. A. Coyne,et al.  The high performance storage system , 1993, Supercomputing '93.

[26]  Yi Wang,et al.  SDQuery DSI: Integrating data management support with a wide area data transfer protocol , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  P. Elmer,et al.  XROOTD-A highly scalable architecture for data access , 2005 .

[28]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[29]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[30]  Scott Klasky,et al.  Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery (Report for the DOE ASCR Workshop on Storage Systems and I/O) , 2018 .

[31]  Prasanna Balaprakash,et al.  Explaining Wide Area Data Transfer Performance , 2017, HPDC.

[32]  Jesús Montes,et al.  A Parallel Data Storage Interface to GridFTP , 2006, OTM Conferences.

[33]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[34]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[35]  Brian D. Noble,et al.  Improving throughput and maintaining fairness using parallel TCP , 2004, IEEE INFOCOM 2004.

[36]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[37]  Ian T. Foster,et al.  Toward a smart data transfer node , 2018, Future Gener. Comput. Syst..

[38]  Craig Partridge,et al.  When the CRC and TCP checksum disagree , 2000, SIGCOMM 2000.

[39]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[40]  Ian T. Foster,et al.  Globus auth: A research identity and access management platform , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[41]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[42]  Christopher D. Carothers,et al.  PANORAMA: An approach to performance modeling and diagnosis of extreme-scale workflows , 2017, Int. J. High Perform. Comput. Appl..

[43]  Lynda L. McGhie,et al.  World Wide Web , 2011, Encyclopedia of Information Assurance.

[44]  Ian T. Foster,et al.  Globus: Research Data Management as Service and Platform , 2017, PEARC.

[45]  Ian Foster,et al.  Toward an Elastic Data Transfer Infrastructure , 2019, 2019 15th International Conference on eScience (eScience).

[46]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[47]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[48]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[49]  Chao Jin,et al.  A Cache-Based Data Movement Infrastructure for On-demand Scientific Cloud Computing , 2019, SCFA.

[50]  Ian Foster,et al.  Parsl: Pervasive Parallel Programming in Python , 2019, HPDC.

[51]  Ian T. Foster,et al.  Globus: Recent Enhancements and Future Plans , 2016, XSEDE.

[52]  Y. Wu,et al.  PhEDEx high-throughput data transfer management system , 2006 .

[53]  Tevfik Kosar,et al.  Big data transfer optimization through adaptive parameter tuning , 2018, J. Parallel Distributed Comput..

[54]  Hemanta Sapkota,et al.  Towards Securing Data Transfers Against Silent Data Corruption , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).