Tiered data management system: Accelerating data processing on HPC systems

Abstract The explosion of scientific data generated from large-scale simulations and advanced sensors makes scientific workflows more complex and more data-intensive. Supporting these data-intensive workflows on high-performance computing systems presents new challenges in data management due to their scales, coordination behaviours, and overall complexities. In this paper, we propose Tiered Data Management System (TDMS) to accelerate scientific workflows on HPC systems. TDMS prevent repetitive data movement by providing efficient data sharing on top of tiered storage architecture. The customized data management for common workflow access patterns allows users to make full use of the advantages of different storage tiers. The extended application interface, which supports user-defined data management strategies, strengthens its ability to handle diverse storage architectures and application scenarios. Moreover, we propose a data-aware task scheduling module to launch tasks on compute nodes where the data locality of required data can be leveraged maximally. We build a prototype and deploy it on a typical HPC system. We evaluate the performance of TDMS with realistic workflows and the experiments show that the TDMS can optimize the I/O performance and provide up to 1.54x speedup for data-intensive workflows compared with Lustre file system.

[1]  Mark Hedges,et al.  Rule-based curation and preservation of data: A data grid approach using iRODS , 2009, Future Gener. Comput. Syst..

[2]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[3]  Michael Stonebraker,et al.  SciDB: A Database Management System for Applications with Complex Analytics , 2013, Computing in Science & Engineering.

[4]  Jonathan Hines,et al.  Stepping up to Summit , 2018, Comput. Sci. Eng..

[5]  Douglas Thain,et al.  Combining Static and Dynamic Storage Management for Data Intensive Scientific Workflows , 2018, IEEE Transactions on Parallel and Distributed Systems.

[6]  Tomoo Ushio,et al.  “Big Data Assimilation” Toward Post-Petascale Severe Weather Prediction: An Overview and Progress , 2016, Proceedings of the IEEE.

[7]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[8]  Michael Stonebraker,et al.  GenBase: a complex analytics genomics benchmark , 2014, SIGMOD Conference.

[9]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[10]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[11]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[12]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[13]  Miron Livny,et al.  Condor and the Grid , 2003 .

[14]  Daniel S. Katz,et al.  Many-Task Computing and Blue Waters , 2012, ArXiv.

[15]  Daniel S. Katz,et al.  Web-based Tools -- Montage: An astronomical image mosaic engine , 2007 .

[16]  Ewa Deelman,et al.  Scaling up workflow-based applications , 2010, J. Comput. Syst. Sci..

[17]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[18]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..