Storage Support for Data-Intensive Applications on Large Scale High-Performance Computing Systems

Many believe that the state-of-the-art yet decades old high-performance computing (HPC) storage would not meet the I/O requirement of the emerging exascale mainly due to the segregation of compute and storage resources. Indeed, our simulation predicts, quantitatively, that the efficiency and availability would go towards zero as the system scales approach exascale. This work proposes a new architecture with nodelocal persistent storage. Although collocating compute and storage has been widely leveraged in cloud computing, such a system never exists in HPC. We implement a system prototype, called FusionFS, with two major design principles: maximal metadata concurrency and optimal file write, both of which are crucial to HPC applications. We also discuss FusionFSs other integral features such as hybrid and cooperative caching, efficient data access to compressed files, space-economic data redundancy, lightweight provenance tracking, and integration with data management systems.

[1]  I. Raicu,et al.  Storage Support for Data-Intensive Applications on Extreme-Scale HPC Systems , 2014 .

[2]  László Böszörményi,et al.  A survey of Web cache replacement strategies , 2003, CSUR.

[3]  Robert B. Ross,et al.  ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization , 2012, HPDC '12.

[4]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[5]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[6]  Xiaocheng Zou,et al.  Scalable in situ scientific data encoding for analytical query processing , 2013, HPDC.

[7]  Margo I. Seltzer,et al.  Layering in Provenance Systems , 2009, USENIX Annual Technical Conference.

[8]  Gruia Calinescu,et al.  Stochastic Strategic Routing Reduces Attack Effects , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[9]  Ioan Raicu,et al.  HyCache: A User-Level Caching Middleware for Distributed File Systems , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[10]  Xiao Qin,et al.  Performance Evaluation of Traditional Caching Policies on a Large System with Petabytes of Data , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[11]  Susanne Albers,et al.  Minimizing stall time in single and parallel disk systems , 2000, J. ACM.

[12]  Toni Cortes,et al.  The RAM Enhanced Disk Cache Project (REDCAP) , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[13]  Vivek S. Pai,et al.  SSDAlloc: Hybrid SSD/RAM Memory Management Made Easy , 2011, NSDI.

[14]  Jeanna Neefe Matthews,et al.  Serverless network file systems , 1996, TOCS.

[15]  A. J. McAuley Reliable broadband communication using a burst erasure correcting code , 1990, SIGCOMM 1990.

[16]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[17]  Tongdan Jin,et al.  Evaluating the performance and energy efficiency of n-body codes on multi-core CPUs and GPUs , 2013, 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC).

[18]  David R. Mathog,et al.  Parallel BLAST on split databases , 2003, Bioinform..

[19]  Luigi Rizzo,et al.  Effective erasure codes for reliable computer communication protocols , 1997, CCRV.

[20]  Ziming Zhang,et al.  Macropower: A coarse-grain power profiling framework for energy-efficient cloud computing , 2011, 30th IEEE International Performance Computing and Communications Conference.

[21]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[22]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[23]  Ben Y. Zhao,et al.  Efficient Batched Synchronization in Dropbox-Like Cloud Storage Services , 2013, Middleware.

[24]  Hong Jiang,et al.  HPDA: A hybrid parity-based disk array for enhanced performance and reliability , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[25]  Margo I. Seltzer,et al.  Berkeley DB , 1999, USENIX Annual Technical Conference, FREENIX Track.

[26]  Peter Lindstrom,et al.  Assessing the effects of data compression in simulations using physically motivated metrics , 2013, SC.

[27]  Song Jiang,et al.  iBridge: Improving Unaligned Parallel File Access with Solid-State Drives , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[28]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[29]  Rong Ge,et al.  Using intelligent prefetching to reduce the energy consumption of a large-scale storage system , 2013, 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC).

[30]  Margo I. Seltzer,et al.  Making a Cloud Provenance-Aware , 2009, Workshop on the Theory and Practice of Provenance.

[31]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[32]  Hong Jiang,et al.  Locality & utility co-optimization for practical capacity management of shared last level caches , 2012, ICS '12.

[33]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[34]  Robert B. Ross,et al.  Byte-precision level of detail processing for variable precision analytics , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[36]  Jian Yin,et al.  Virtual chunks: On supporting random accesses to scientific data in compressible storage systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[37]  Liu Shi,et al.  BWCC: A FS-Cache Based Cooperative Caching System for Network Storage System , 2012, 2012 IEEE International Conference on Cluster Computing.

[38]  Zhao Zhang,et al.  Design and evaluation of a collective IO model for loosely coupled petascale programming , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[39]  Zhao Zhang,et al.  Toward loosely coupled programming on petascale systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[41]  Zhao Zhang,et al.  Towards Loo on , 2008 .

[42]  KyoungSoo Park,et al.  Supporting Practical Content-Addressable Caching with CZIP Compression , 2007, USENIX Annual Technical Conference.

[43]  Feng Chen,et al.  Hystor: making the best use of solid state drives in high performance storage systems , 2011, ICS '11.

[44]  Francesco Quaglia,et al.  Nonblocking Checkpointing for Optimistic Parallel Simulation: Description and an Implementation , 2003, IEEE Trans. Parallel Distributed Syst..

[45]  James N. England A system for interactive modeling of physical curved surface objects , 1978, SIGGRAPH '78.

[46]  Mike Loukides,et al.  Managing NFS and NIS , 1991 .

[47]  Yong Zhao,et al.  Opportunities and Challenges in Running Scientific Workflows on the Cloud , 2011, 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[48]  Ge-Ming Chiu,et al.  A New Diskless Checkpointing Approach for Multiple Processor Failures , 2011, IEEE Transactions on Dependable and Secure Computing.

[49]  Chentao Wu,et al.  Hint-K: An Efficient Multilevel Cache Using K-Step Hints , 2014, IEEE Trans. Parallel Distributed Syst..

[50]  Jian Yin,et al.  Improving the I / O Throughput for Data-Intensive Scientific Applications with Efficient Compression Mechanisms , 2013 .

[51]  Scott A. Brandt,et al.  MRAMFS: a compressing file system for non-volatile RAM , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[52]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[53]  A. L. Narasimha Reddy,et al.  SCMFS: A file system for Storage Class Memory , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[54]  Yang Wang,et al.  Robustness in the Salus Scalable Block Store , 2013, NSDI.

[55]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[56]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[57]  Christoph Ambühl,et al.  Parallel Prefetching and Caching Is Hard , 2004, STACS.

[58]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[59]  Himabindu Pucha,et al.  Cost Effective Storage using Extent Based Dynamic Tiering , 2011, FAST.

[60]  Martin Schulz,et al.  Scalable compression and replay of communication traces in massively parallel environments , 2006, SC.

[61]  Fareed Zaffar,et al.  Sketching Distributed Data Provenance , 2013 .

[62]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[63]  Kwan-Liu Ma,et al.  In-situ processing and visualization for ultrascale simulations , 2007 .

[64]  Thomas Heinis,et al.  Efficient lineage tracking for scientific workflows , 2008, SIGMOD Conference.

[65]  Rolf Riesen,et al.  The Viability of Using Compression to Decrease Message Log Sizes , 2012, Euro-Par Workshops.

[66]  Robert L. Grossman,et al.  Supporting Configurable Congestion Control in Data Transport Services , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[67]  James Lee Hafner,et al.  Matrix methods for lost data reconstruction in erasure codes , 2005, FAST'05.

[68]  Yun Tian,et al.  Energy Efficient Prefetching with Buffer Disks for Cluster File Systems , 2010, 2010 39th International Conference on Parallel Processing.

[69]  Jeffrey Katcher,et al.  PostMark: A New File System Benchmark , 1997 .

[70]  Ke Wang,et al.  Exploring reliability of exascale systems through simulations , 2013, SpringSim.

[71]  Sanam Shahla Rizvi,et al.  Flash SSD vs HDD: High performance oriented modern embedded and multimedia storage systems , 2010, 2010 2nd International Conference on Computer Engineering and Technology.

[72]  Butler W. Lampson,et al.  On-line data compression in a log-structured file system , 1992, ASPLOS V.

[73]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[74]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[75]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[76]  Brent Welch,et al.  Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[77]  Robert Latham,et al.  ISABELA-QA: Query-driven analytics with ISABELA-compressed extreme-scale scientific data , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[78]  Dan Feng,et al.  Improving flash-based disk cache with Lazy Adaptive Replacement , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[79]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[80]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[81]  Chen Shou,et al.  Towards a provenance-aware distributed filesystem , 2013 .

[82]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[83]  Gang Wang,et al.  HERO: Heterogeneity-aware erasure coded redundancy optimal allocation for reliable storage in distributed networks , 2012, 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC).

[84]  Antony I. T. Rowstron,et al.  Rhea: Automatic Filtering for Unstructured Cloud Storage , 2013, NSDI.

[85]  Hong Jiang,et al.  FARMER: A novel approach to file access correlation mining and evaluation reference model , 2008, HPDC '08.

[86]  Xiaodong Zhang,et al.  Access-Mode Predictions for Low-Power Cache Design , 2002, IEEE Micro.

[87]  Gruia Calinescu,et al.  Asymmetric topology control: Exact solutions and fast approximations , 2012, 2012 Proceedings IEEE INFOCOM.

[88]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[89]  Mario Blaum,et al.  SD codes: erasure codes designed for how storage systems really fail , 2013, FAST.

[90]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[91]  Hua Zhang,et al.  Expander code: A scalable erasure-resilient code to keep up with data growth in distributed storage , 2013, 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC).

[92]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[93]  Weiguo Liu,et al.  Bio-sequence database scanning on a GPU , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[94]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[95]  GhemawatSanjay,et al.  The Google file system , 2003 .

[96]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[97]  Anna R. Karlin,et al.  A study of integrated prefetching and caching strategies , 1995, SIGMETRICS '95/PERFORMANCE '95.

[98]  Xiaozhou Li,et al.  Efficient querying and maintenance of network provenance at internet-scale , 2010, SIGMOD Conference.

[99]  André Brinkmann,et al.  Block locality caching for data deduplication , 2013, SYSTOR '13.

[100]  Xiao Qin,et al.  Thermal modeling and analysis of storage systems , 2012, 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC).

[101]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[102]  Ioan Raicu,et al.  Towards high-performance and cost-effective distributed storage systems with information dispersal algorithms , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[103]  Chen Shou,et al.  Distributed data provenance for large-scale data-intensive computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[104]  Jian Yin,et al.  Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[105]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[106]  Andrew Y. C. Nee,et al.  GA-BHTR: an improved genetic algorithm for partner selection in virtual manufacturing , 2012 .

[107]  Chunxiao Xing,et al.  A Cache Replacement Algorithm in Hierarchical Storage of Continuous Media Object , 2004, WAIM.

[108]  H. Howie Huang,et al.  Black-Box Performance Modeling for Solid-State Drives , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[109]  Zhao Zhang,et al.  Enabling software management for multicore caches with a lightweight hardware support , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[110]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[111]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[112]  Li Yang,et al.  Incremental Isometric Embedding of High-Dimensional Data Using Connected Neighborhood Graphs , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[113]  Robert Latham,et al.  Scalable I/O forwarding framework for high-performance computing systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[114]  Barbara Horner-Miller,et al.  Proceedings of the 2006 ACM/IEEE conference on Supercomputing , 2006 .

[115]  Anthony Skjellum,et al.  Gibraltar: A Reed‐Solomon coding library for storage applications on programmable graphics processors , 2011, Concurr. Comput. Pract. Exp..

[116]  Ashish Gehani,et al.  Performance and extension of user space file systems , 2010, SAC '10.

[117]  Robert L. Grossman,et al.  Distributing the Sloan Digital Sky Survey Using UDT and Sector , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[118]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[119]  Yun Tian,et al.  Reliability analysis of an energy-aware RAID system , 2011, 30th IEEE International Performance Computing and Communications Conference.

[120]  Mukesh Singhal,et al.  On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..

[121]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[122]  Kwan-Liu Ma In situ visualization at extreme scale: challenges and opportunities. , 2009, IEEE computer graphics and applications.

[123]  James J. Lu,et al.  Solving SQL Constraints by Incremental Translation to SAT , 2008, IEA/AIE.

[124]  Ashish Gehani,et al.  SPADE: Support for Provenance Auditing in Distributed Environments , 2012, Middleware.

[125]  Peter Freeman,et al.  Cyberinfrastructure for Science and Engineering: Promises and Challenges , 2005, Proceedings of the IEEE.

[126]  Margo Seltzer,et al.  Foundations for provenance-aware systems , 2010 .

[127]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2003, J. Algorithms.

[128]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[129]  Li Yang,et al.  Incremental Construction of Neighborhood Graphs for Nonlinear Dimensionality Reduction , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[130]  André Brinkmann,et al.  File recipe compression in data deduplication systems , 2013, FAST.

[131]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[132]  Song Jiang,et al.  iTransformer: Using SSD to Improve Disk Scheduling for High-performance I/O , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[133]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[134]  Yifeng Zhu,et al.  Energy and thermal aware buffer cache replacement algorithm , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[135]  K Qiao,et al.  A GA maintained by binary heap and transitive reduction for addressing PSP , 2010, 2010 International Conference on Intelligent Computing and Integrated Systems.

[136]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX Annual Technical Conference, General Track.

[137]  Daniel S. Katz,et al.  Design and analysis of data management in scalable parallel scripting , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[138]  N. D. Durie,et al.  Digest of papers , 1976 .

[139]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[140]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[141]  Kang G. Shin,et al.  FAST: Quick Application Launch on Solid-State Drives , 2011, FAST.

[142]  Richard Bellman,et al.  Dynamic Programming Treatment of the Travelling Salesman Problem , 1962, JACM.

[143]  Yuan Xie,et al.  Hybrid checkpointing using emerging nonvolatile memories for future exascale systems , 2011, TACO.

[144]  Peter Schulthess,et al.  Pageserver: High-Performance SSD-Based Checkpointing of Transactional Distributed Memory , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[145]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[146]  Gang Wang,et al.  In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes , 2009, PVM/MPI.

[147]  Robert B. Ross,et al.  Multi-level Layout Optimization for Efficient Spatio-temporal Queries on ISABELA-compressed Data , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[148]  Alexander S. Szalay,et al.  Accelerating large-scale data exploration through data diffusion , 2008, DADC '08.

[149]  Ziming Zhang,et al.  An adaptive power management framework for autonomic resource configuration in cloud computing infrastructures , 2012, 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC).

[150]  Bronis R. de Supinski,et al.  McrEngine: a scalable checkpointing system using data-aware aggregation and compression , 2012, HiPC 2012.

[151]  Ian T. Foster,et al.  Making a case for distributed file systems at Exascale , 2011, LSAP '11.

[152]  Rui Li,et al.  A prefetching model based on access popularity for geospatial data in a cluster-based caching system , 2012, Int. J. Geogr. Inf. Sci..

[153]  Dhabaleswar K. Panda,et al.  Enhancing Checkpoint Performance with Staging IO and SSD , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[154]  Qing Yang,et al.  I-CASH: Intelligently Coupled Array of SSD and HDD , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.