论文信息 - Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance

Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance

It is hard for applications to make full utilization of the peak bandwidth of the storage system in highperformance computers because of I/O interferences, storage resource misallocations and complex long I/O paths. We performed several studies to bridge this gap in the Sunway storage system, which serves the supercomputer Sunway TaihuLight. To locate these issues and connections between them, an end-to-end performance monitoring and diagnosis tool was developed to understand I/O behaviors of applications and the system. With the help of the tool, we were about to find out the root causes of such performance barriers at the I/O forwarding layer and the parallel file system layer. An application-aware I/O forwarding allocation framework was used to address the I/O interferences and resource misallocations at the I/O forwarding layer. A performance-aware data placement mechanism was proposed to mitigate the impact of I/O interferences and performance variations of storage devices in the PFS. Together, applications obtained much better I/O performance. During the process, we also proposed a lightweight storage stack to shorten the I/O path of applications with -N I/O pattern. This paper summarizes these studies and presents the lessons learned from the process.

[1] Robert B. Ross,et al. Optimization Techniques at the I/O Forwarding Layer , 2010, 2010 IEEE International Conference on Cluster Computing.

[2] Christian Scheideler,et al. Towards a Scalable and Robust DHT , 2006, SPAA '06.

[3] Robert Latham,et al. 24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4] Aditya Akella,et al. Altruistic Scheduling in Multi-Resource Clusters , 2016, OSDI.

[5] Yang Liu,et al. Automatic identification of application I/O signatures from noisy server-side traces , 2014, FAST.

[6] Robert Latham,et al. Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7] Raghul Gunasekaran,et al. Understanding I/O workload characteristics of a Peta-scale storage system , 2015, The Journal of Supercomputing.

[8] Fan Guo,et al. Scaling Embedded In-Situ Indexing with DeltaFS , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] John Shalf,et al. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] Robert B. Ross,et al. On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11] Robert B. Ross,et al. Accelerating I/O Forwarding in IBM Blue Gene/P Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[13] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[14] Wang Teng,et al. An Ephemeral Burst-Buffer File System for Scientific Applications , 2016 .

[15] Valentina Timcenko,et al. Ext4 file system performance analysis in linux environment , 2011 .

[16] David A Dillow,et al. Lessons Learned in Deploying the World’s Largest Scale Lustre File System , 2010 .

[17] Toni Cortes,et al. Using filesystem virtualization to avoid metadata bottlenecks , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[18] André Brinkmann,et al. GekkoFS - A Temporary Distributed File System for HPC Applications , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[19] Wei-keng Liao,et al. Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, HiPC 2008.

[20] Bran Selic,et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[21] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[22] Kamil Iskra,et al. ZOID: I/O-forwarding infrastructure for petascale architectures , 2008, PPoPP.

[23] Zhe Zhang,et al. Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems , 2011, 30th IEEE International Performance Computing and Communications Conference.

[24] Karsten Schwan,et al. Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25] S GunawiHaryadi,et al. Fail-Slow at Scale , 2018 .

[26] Shane Snyder,et al. A Year in the Life of a Parallel File System , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27] Michael A. Bender,et al. File Systems Fated for Senescence? Nonsense, Says Science! , 2017, FAST.

[28] Jie Yao,et al. ROS , 2018, ACM Transactions on Storage.

[29] Bingsheng He,et al. Incorporating Probabilistic Optimizations for Resource Provisioning of Data Processing Workflows , 2019, ICPP.

[30] Robert B. Ross,et al. Fail-Slow at Scale , 2018, ACM Trans. Storage.

[31] Wenguang Chen,et al. ShenTu: Processing Multi-Trillion Edge Graphs on Millions of Cores in Seconds , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32] Weiguo Liu,et al. Automatic, Application-Aware I/O Forwarding Resource Allocation , 2019, FAST.

[33] Franck Cappello,et al. Scheduling the I/O of HPC Applications Under Congestion , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[34] S.A. Brandt,et al. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[35] André Brinkmann,et al. A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36] Felix Wolf,et al. Scalable massively parallel I/O to task-local files , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[37] Srikanth Kandula,et al. This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Graphene: Packing and Dependency-aware Scheduling for Data-parallel Clusters G: Packing and Dependency-aware Scheduling for Data-parallel Clusters , 2022 .

[38] Andrew J. Hutton,et al. Lustre: Building a File System for 1,000-node Clusters , 2003 .

[39] Weiguo Liu,et al. End-to-end I/O Monitoring on Leading Supercomputers , 2022, NSDI.