Enhance parallel input/output with cross-bundle aggregation

The exponential growth of computing power on leadership scale computing platforms imposes grand challenge to scientific applications’ input/output (I/O) performance. To bridge the performance gap between computation and I/O, various parallel I/O libraries have been developed and adopted by computer scientists. These libraries enhance the I/O parallelism by allowing multiple processes to concurrently access the shared data set. Meanwhile, they are integrated with a set of I/O optimization strategies such as data sieving and two-phase I/O to better exploit the supplied bandwidth of the underlying parallel file system. Most of these techniques are optimized for the access on a single bundle of variables generated by the scientific applications during the I/O phase, which is stored in the form of file. Few of these techniques focus on cross-bundle I/O optimizations. In this article, we investigate the potential benefit from cross-bundle I/O aggregation. Based on the analysis of the I/O patterns of a mission-critical scientific application named the Goddard Earth Observing System, version 5 (GEOS-5), we propose a Bundle-based PARallel Aggregation (BPAR) framework with three partitioning schemes to improve its I/O performance as well as the I/O performance of a broad range of other scientific applications. Our experiment result reveals that BPAR can deliver 2.1× I/O performance improvement over the baseline GEOS-5, and it is very promising in accelerating scientific applications’ I/O performance on various computing platforms.

[1]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2]  Cong Xu,et al.  Assessing the Performance Impact of High-Speed Interconnects on MapReduce , 2012, WBDB.

[3]  Wei-keng Liao,et al.  I/O analysis and optimization for an AMR cosmology application , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[4]  Teng Wang,et al.  BurstMem: A high-performance burst buffer system for scientific applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[5]  Iain Bethune,et al.  Adding Parallel I/O to PARA-BMU , 2012 .

[6]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[7]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[8]  Wei-keng Liao,et al.  Scaling parallel I/O performance through I/O delegate and caching system , 2008, HiPC 2008.

[9]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[10]  Jeffrey S. Vetter,et al.  ParColl: Partitioned Collective I/O on the Cray XT , 2008, 2008 37th International Conference on Parallel Processing.

[11]  Surendra Byna,et al.  Parallel I/O prefetching using MPI file caching and I/O signatures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Yong Chen,et al.  Locality-driven high-level I/O aggregation for processing scientific datasets , 2013, 2013 IEEE International Conference on Big Data.

[14]  Zhiwei Xu,et al.  DataMPI: Extending MPI to Hadoop-Like Big Data Computing , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[16]  Jeffrey S. Vetter,et al.  Performance characterization and optimization of parallel I/O on the Cray XT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[17]  Robert Latham,et al.  Combining I/O operations for multiple array variables in parallel netCDF , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[18]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[19]  Karsten Schwan,et al.  Adaptable, metadata rich IO methods for portable high performance IO , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22]  Michael Stonebraker,et al.  Efficient organization of large multidimensional arrays , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[23]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[24]  Rob VanderWijngaart,et al.  NAS Parallel Benchmarks I/O Version 2.4. 2.4 , 2002 .

[25]  Teng Wang,et al.  A case of system-wide power management for scientific applications , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  Galen M. Shipman,et al.  A Next-Generation Parallel File System Environment for the OLCF , 2012 .

[27]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[28]  Hui Chen,et al.  BPAR: A Bundle-Based Parallel Aggregation Framework for Decoupled I/O Execution , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[29]  Scott Klasky,et al.  A lightweight I/O scheme to facilitate spatial and temporal queries of scientific data analytics , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[30]  Saurabh Gupta,et al.  Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  John Shalf,et al.  Tuning HDF5 for Lustre File Systems , 2010 .

[32]  Eric Barton,et al.  A Novel network request scheduler for a large scale storage system , 2009, Computer Science - Research and Development.

[33]  Cong Xu,et al.  Profiling and Improving I/O Performance of a Large-Scale Climate Scientific Application , 2013, 2013 22nd International Conference on Computer Communication and Networks (ICCCN).

[34]  Robert Latham,et al.  Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[35]  Teng Wang,et al.  Characterization and Optimization of Memory-Resident MapReduce on HPC Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[36]  Robert B. Ross,et al.  Accelerating I/O Forwarding in IBM Blue Gene/P Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Feiyi Wang,et al.  OLCF ’ s 1 TB / s , Next-Generation Lustre File System , 2013 .

[38]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[39]  Teng Wang,et al.  TRIO: Burst Buffer Based I/O Orchestration , 2015, 2015 IEEE International Conference on Cluster Computing.

[40]  Surendra Byna,et al.  Parallel I/O prefetching using MPI file caching and I/O signatures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Cong Xu,et al.  SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[42]  Dror G. Feitelson,et al.  Overview of the MPI-IO Parallel I/O Interface , 1996, Input/Output in Parallel and Distributed Computer Systems.