I/O-aware bandwidth allocation for petascale computing systems

In the Big Data era, the gap between the storage performance and an application's I/O requirement is increasing. I/O congestion caused by concurrent storage accesses from multiple applications is inevitable and severely harms the performance. Conventional approaches either focus on optimizing an application's access pattern individually or handle I/O requests on a low-level storage layer without any knowledge from the upper-level applications. In this paper, we present a novel I/O-aware bandwidth allocation framework to coordinate ongoing I/O requests on petascale computing systems. The motivation behind this innovation is that the resource management system has a holistic view of both the system state and jobs' activities and can dynamically control the jobs' status or allocate resource on the fly during their execution. We treat a job's I/O requests as periodical sub-jobs within its lifecycle and transform the I/O congestion issue into a classical scheduling problem. Based on this model, we propose a bandwidth management mechanism as an extension to the existing scheduling system. We design several bandwidth allocation policies with different optimization objectives either on user-oriented metrics or system performance. We conduct extensive trace-based simulations using real job traces and I/O traces from a production IBM Blue Gene/Q system at Argonne National Laboratory. Experimental results demonstrate that our new design can improve job performance by more than 30%, as well as increasing system performance.

[1]  Song Jiang,et al.  Opportunistic Data-driven Execution of Parallel Programs for Efficient I/O Services , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2]  Song Jiang,et al.  IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Zhiling Lan,et al.  Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[4]  Zhiling Lan,et al.  Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[6]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Robert Latham,et al.  Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[8]  Song Jiang,et al.  iTransformer: Using SSD to Improve Disk Scheduling for High-performance I/O , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[9]  Michael Lang,et al.  Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software , 2016, IEEE Transactions on Parallel and Distributed Systems.

[10]  Jia Wang,et al.  I/O-Aware Batch Scheduling for Petascale Computing Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[11]  Jay F. Lofstead,et al.  Insights for exascale IO APIs from building a petascale IO API , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  James Patton Jones,et al.  Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[13]  Scott Klasky,et al.  Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Jia Wang,et al.  Balancing job performance with system performance via locality-aware scheduling on torus-connected systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[16]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC.

[17]  Michael Lang,et al.  Using simulation to explore distributed key-value stores for extreme-scale system services , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Ke Wang,et al.  SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale , 2013, SpringSim.

[19]  Kento Aida,et al.  Evaluation of Performance Degradation in HPC Applications with VM Consolidation , 2012, 2012 Third International Conference on Networking and Computing.

[20]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[21]  Leonid Oliker,et al.  Parallel I/O performance: From events to ensembles , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22]  Robert B. Ross,et al.  CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[23]  Franck Cappello,et al.  Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24]  Feng Chen,et al.  Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[25]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Zhiling Lan,et al.  Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[27]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[28]  Peter Freeman,et al.  Cyberinfrastructure for Science and Engineering: Promises and Challenges , 2005, Proceedings of the IEEE.

[29]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.