论文信息 - Client-side Straggler-Aware I/O Scheduler for Object-based Parallel File Systems

Client-side Straggler-Aware I/O Scheduler for Object-based Parallel File Systems

Object-based parallel file systems have emerged as promising storage solutions for high-performance computing (HPC) systems. Despite the fact that object storage provides a flexible interface, scheduling highly concurrent I/O requests that access a large number of objects still remains as a challenging problem, especially in the case when stragglers (storage servers that are significantly slower than others) exist in the system. An efficient I/O scheduler needs to avoid possible stragglers to achieve low latency and high throughput. In this paper, we introduce a log-assisted straggler-aware I/O scheduling to mitigate the impact of storage server stragglers. The contribution of this study is threefold. First, we introduce a client-side, log-assisted, straggler-aware I/O scheduler architecture to tackle the storage straggler issue in HPC systems. Second, we present three scheduling algorithms that can make efficient decision for scheduling I/Os while avoiding stragglers based on such an architecture. Third, we evaluate the proposed I/O scheduler using simulations, and the simulation results have confirmed the promise of the newly introduced straggler-aware I/O scheduler.

[1] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2] Scott Klasky,et al. Output Performance Study on a Production Petascale Filesystem , 2017, ISC Workshops.

[3] Scott A. Brandt,et al. OBFS: A File System for Object-Based Storage Devices , 2004, MSST.

[4] Yu Zhuang,et al. Hierarchical Collective I/O Scheduling for High-Performance Computing , 2015, Big Data Res..

[5] Robert Latham,et al. Revealing applications' access pattern in collective I/O for cache management , 2014, ICS '14.

[6] Mark S. Squillante,et al. Models of Parallel Applications with Large Computation and I/O Requirements , 2002, IEEE Trans. Software Eng..

[7] I. Olkin,et al. A Multivariate Exponential Distribution , 1967 .

[8] Robert B. Ross,et al. CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9] Alok N. Choudhary,et al. Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[10] Yong Chen,et al. Log-Assisted Straggler-Aware I/O Scheduler for High-End Computing , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[11] Arie Shoshani,et al. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[12] Zhen Xiao,et al. Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[13] Scott Shenker,et al. The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[14] Robert B. Ross,et al. Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15] Rajeev Thakur,et al. On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[16] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[17] Jianwei Li,et al. Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[18] Scott Klasky,et al. Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[19] M. Factor,et al. Object storage: the future building block for storage systems , 2005, 2005 IEEE International Symposium on Mass Storage Systems and Technology.

[20] Gregory R. Ganger,et al. Object-based storage , 2003, IEEE Commun. Mag..

[21] Ravi Jain,et al. Parallel I/O scheduling using randomized, distributed edge coloring algorithms , 2003, J. Parallel Distributed Comput..

[22] Dror G. Feitelson,et al. Paired Gang Scheduling , 2003, IEEE Trans. Parallel Distributed Syst..

[23] Wing Cheong Lau,et al. Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[24] Rajeev Thakur,et al. Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[25] Chao Wang,et al. Sedna: A Memory Based Key-Value Storage System for Realtime Processing in Cloud , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[26] Xiaoqi Ren. Speculation-Aware Resource Allocation for Cluster Schedulers , 2015 .

[27] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[28] Peter Braam,et al. The Lustre Storage Architecture , 2019, ArXiv.

[29] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30] Carlos Maltzahn,et al. I/O acceleration with pattern detection , 2013, HPDC.

[31] Karsten Schwan,et al. Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[32] Adam Wierman,et al. This Paper Is Included in the Proceedings of the 11th Usenix Symposium on Networked Systems Design and Implementation (nsdi '14). Grass: Trimming Stragglers in Approximation Analytics Grass: Trimming Stragglers in Approximation Analytics , 2022 .

[33] Samuel Lang,et al. Server-side I/O coordination for parallel file systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[34] Ravi Jain,et al. Heuristics for Scheduling I/O Operations , 1997, IEEE Trans. Parallel Distributed Syst..

[35] Patrick Wendell,et al. Sparrow: distributed, low latency scheduling , 2013, SOSP.

[36] Renato Figueiredo,et al. Towards simulation of parallel file system scheduling algorithms with PFSsim , 2011 .

[37] Scott Klasky,et al. Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[38] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[39] Mark S. Squillante,et al. The impact of I/O on program behavior and parallel scheduling , 1998, SIGMETRICS '98/PERFORMANCE '98.

[40] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[41] Pascal Raymond,et al. The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[42] Robert B. Ross,et al. Provenance-based object storage prediction scheme for scientific big data applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[43] Shikharesh Majumdar,et al. Performance of parallel I/O scheduling strategies on a network of workstations , 2001, Proceedings. Eighth International Conference on Parallel and Distributed Systems. ICPADS 2001.

[44] Scott Shenker,et al. Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly , 2012, HotCloud.