论文信息 - I/O-aware bandwidth allocation for petascale computing systems

I/O-aware bandwidth allocation for petascale computing systems

In the Big Data era, the gap between the storage performance and an application's I/O requirement is increasing. I/O congestion caused by concurrent storage accesses from multiple applications is inevitable and severely harms the performance. Conventional approaches either focus on optimizing an application's access pattern individually or handle I/O requests on a low-level storage layer without any knowledge from the upper-level applications. In this paper, we present a novel I/O-aware bandwidth allocation framework to coordinate ongoing I/O requests on petascale computing systems. The motivation behind this innovation is that the resource management system has a holistic view of both the system state and jobs' activities and can dynamically control the jobs' status or allocate resource on the fly during their execution. We treat a job's I/O requests as periodical sub-jobs within its lifecycle and transform the I/O congestion issue into a classical scheduling problem. Based on this model, we propose a bandwidth management mechanism as an extension to the existing scheduling system. We design several bandwidth allocation policies with different optimization objectives either on user-oriented metrics or system performance. We conduct extensive trace-based simulations using real job traces and I/O traces from a production IBM Blue Gene/Q system at Argonne National Laboratory. Experimental results demonstrate that our new design can improve job performance by more than 30%, as well as increasing system performance.

[1] Song Jiang,et al. Opportunistic Data-driven Execution of Parallel Programs for Efficient I/O Services , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2] Song Jiang,et al. IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Zhiling Lan,et al. Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[4] Zhiling Lan,et al. Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[5] Ibm Blue,et al. Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[6] Robert B. Ross,et al. On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[7] Robert Latham,et al. Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[8] Song Jiang,et al. iTransformer: Using SSD to Improve Disk Scheduling for High-performance I/O , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[9] Michael Lang,et al. Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software , 2016, IEEE Transactions on Parallel and Distributed Systems.

[10] Jia Wang,et al. I/O-Aware Batch Scheduling for Petascale Computing Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[11] Jay F. Lofstead,et al. Insights for exascale IO APIs from building a petascale IO API , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12] James Patton Jones,et al. Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[13] Scott Klasky,et al. Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Karsten Schwan,et al. Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15] Jia Wang,et al. Balancing job performance with system performance via locality-aware scheduling on torus-connected systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[16] Karsten Schwan,et al. DataStager: scalable data staging services for petascale applications , 2009, HPDC.

[17] Michael Lang,et al. Using simulation to explore distributed key-value stores for extreme-scale system services , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18] Ke Wang,et al. SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale , 2013, SpringSim.

[19] Kento Aida,et al. Evaluation of Performance Degradation in HPC Applications with VM Consolidation , 2012, 2012 Third International Conference on Networking and Computing.

[20] Franck Cappello,et al. Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[21] Leonid Oliker,et al. Parallel I/O performance: From events to ensembles , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22] Robert B. Ross,et al. CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[23] Franck Cappello,et al. Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24] Feng Chen,et al. Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[25] Katherine E. Isaacs,et al. There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26] Zhiling Lan,et al. Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[27] D. Skinner,et al. Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[28] Peter Freeman,et al. Cyberinfrastructure for Science and Engineering: Promises and Challenges , 2005, Proceedings of the IEEE.

[29] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.