论文信息 - I/O-Aware Batch Scheduling for Petascale Computing Systems

I/O-Aware Batch Scheduling for Petascale Computing Systems

In the Big Data era, the gap between the storage performance and an application's I/O requirement is increasing. I/O congestion caused by concurrent storage accesses from multiple applications is inevitable and severely harms the performance. Conventional approaches either focus on optimizing an application's access pattern individually or handle I/O requests on a low-level storage layer without any knowledge from the upper-level applications. In this paper, we present a novel I/O-aware batch scheduling framework to coordinate ongoing I/O requests on petascale computing systems. The motivation behind this innovation is that the batch scheduler has a holistic view of both the system state and jobs' activities and can control the jobs' status on the fly during their execution. We treat a job's I/O requests as periodical subjobs within its lifecycle and transform the I/O congestion issue into a classical scheduling problem. We design two scheduling polices with different scheduling objectives either on user-oriented metrics or system performance. We conduct extensive trace-based simulations using real job traces and I/O traces from a production IBM Blue Gene/Q system. Experimental results demonstrate that our design can improve job performance by more than 30%, as well as increasing system performance.

[1] Kento Aida,et al. Evaluation of Performance Degradation in HPC Applications with VM Consolidation , 2012, 2012 Third International Conference on Networking and Computing.

[2] Zhiling Lan,et al. Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[3] Franck Cappello,et al. Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[4] Karsten Schwan,et al. Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Michael Lang,et al. Next generation job management systems for extreme-scale ensemble computing , 2014, HPDC '14.

[6] Michael Gschwind,et al. The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[7] Ke Wang,et al. SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale , 2013, SpringSim.

[8] Peter Freeman,et al. Cyberinfrastructure for Science and Engineering: Promises and Challenges , 2005, Proceedings of the IEEE.

[9] Robert B. Ross,et al. On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[10] James Patton Jones,et al. Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[11] Jay F. Lofstead,et al. Insights for exascale IO APIs from building a petascale IO API , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12] Karsten Schwan,et al. DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[13] Scott Klasky,et al. Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Adrien Lèbre,et al. I/O Scheduling Service for Multi-Application Clusters , 2006, 2006 IEEE International Conference on Cluster Computing.

[15] Michael Lang,et al. Using simulation to explore distributed key-value stores for extreme-scale system services , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16] Robert B. Ross,et al. CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[17] Song Jiang,et al. Opportunistic Data-driven Execution of Parallel Programs for Efficient I/O Services , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18] Zhiling Lan,et al. Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19] Zhiling Lan,et al. Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling , 2013, JSSPP.

[20] Michael Lang,et al. Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing , 2015, HPDC.

[21] Jia Wang,et al. Balancing job performance with system performance via locality-aware scheduling on torus-connected systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[22] Andy B. Yoo,et al. Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[23] Song Jiang,et al. iTransformer: Using SSD to Improve Disk Scheduling for High-performance I/O , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[24] Leonid Oliker,et al. Parallel I/O performance: From events to ensembles , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25] Franck Cappello,et al. Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[26] Garrick Staples,et al. TORQUE resource manager , 2006, SC.

[27] Feng Chen,et al. Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[28] Michael Lang,et al. Exploring the Design Tradeoffs for Extreme-Scale High-Performance Computing System Software , 2016, IEEE Transactions on Parallel and Distributed Systems.

[29] Xu Yang,et al. Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30] D. Skinner,et al. Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[31] Robert Latham,et al. Understanding and improving computational science storage access through continuous characterization , 2011, MSST.

[32] Song Jiang,et al. IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[33] Zhiling Lan,et al. Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[34] Ibm Blue,et al. Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[35] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.