Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations

The state-of-the-art storage architecture of high-performance computing systems was designed decades ago, and with today's scale and level of concurrency, it is showing significant limitations. Our recent work proposed a new architecture to address the I/O bottleneck of the conventional wisdom, and the system prototype (FusionFS) demonstrated its effectiveness on up to 16 K nodes-the scale on par with today's largest supercomputers. The main objective of this paper is to investigate FusionFS's scalability towards exascale. Exascale computers are predicted to emerge by 2018, comprising millions of cores and billions of threads. We built an event-driven simulator (FusionSim) according to the FusionFS architecture, and validated it with FusionFS's traces. FusionSim introduced less than 4 percent error between its simulation results and FusionFS traces. With FusionSim we simulated workloads on up to two million nodes and find out almost linear scalability of I/O performance; results justified FusionFS's viability for exascale systems. In addition to the simulation work, this paper extends the FusionFS system prototype in the following perspectives: (1) the fault tolerance of file metadata is supported, (2) the limitations of the current system design is discussed, and (3) a more thorough performance evaluation is conducted, such as N-to-1 metadata write, system efficiency, and more platforms such as Amazon Cloud.

[1]  Robert L. Grossman,et al.  Supporting Configurable Congestion Control in Data Transport Services , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[2]  Jia Wang,et al.  I/O-Aware Batch Scheduling for Petascale Computing Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[3]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Ke Wang,et al.  GRAPH/Z: A Key-Value Store Based Scalable Graph Processing System , 2015, 2015 IEEE International Conference on Cluster Computing.

[5]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Robert B. Ross,et al.  Model and simulation of exascale communication networks , 2012, J. Simulation.

[8]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[9]  Jian Yin,et al.  Virtual chunks: On supporting random accesses to scientific data in compressible storage systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[10]  Ke Wang,et al.  Exploring reliability of exascale systems through simulations , 2013, SpringSim.

[11]  Mario Blaum,et al.  SD codes: erasure codes designed for how storage systems really fail , 2013, FAST.

[12]  Christopher D. Carothers,et al.  Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation , 2011, 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation.

[13]  Robert B. Ross,et al.  Modeling a Million-Node Dragonfly Network Using Massively Parallel Discrete-Event Simulation , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[14]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[15]  Christopher D. Carothers,et al.  ROSS: a high-performance, low memory, modular time warp system , 2000, Proceedings Fourteenth Workshop on Parallel and Distributed Simulation.

[16]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .

[17]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[18]  Jian Yin,et al.  Dynamic Virtual Chunks: On Supporting Efficient Accesses to Compressed Scientific Data , 2016, IEEE Transactions on Services Computing.

[19]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[20]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[21]  Li Yang,et al.  Incremental Isometric Embedding of High-Dimensional Data Using Connected Neighborhood Graphs , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Li Yang,et al.  Incremental Construction of Neighborhood Graphs for Nonlinear Dimensionality Reduction , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[23]  Mike Loukides,et al.  Managing NFS and NIS , 1991 .

[24]  Jian Yin,et al.  Improving the I / O Throughput for Data-Intensive Scientific Applications with Efficient Compression Mechanisms , 2013 .

[25]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[26]  Ioan Raicu,et al.  Storage Support for Data-Intensive Scientific Applications on the Cloud , 2014 .

[27]  Ashish Gehani,et al.  Performance and extension of user space file systems , 2010, SAC '10.

[28]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Daniel S. Katz,et al.  Design and analysis of data management in scalable parallel scripting , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Ke Wang,et al.  A Dynamically Scalable Cloud Data Infrastructure for Sensor Networks , 2015, ScienceCloud@HPDC.

[31]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[32]  David R. Mathog,et al.  Parallel BLAST on split databases , 2003, Bioinform..

[33]  Xu Yang,et al.  High-Performance Storage Support for Scientific Applications on the Cloud , 2015, ScienceCloud@HPDC.

[34]  James J. Lu,et al.  Solving SQL Constraints by Incremental Translation to SAT , 2008, IEA/AIE.

[35]  Peter Freeman,et al.  Cyberinfrastructure for Science and Engineering: Promises and Challenges , 2005, Proceedings of the IEEE.

[36]  Xiaocheng Zou,et al.  Transparent in Situ Data Transformations in ADIOS , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[37]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[38]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[39]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[40]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[41]  Douglas Thain,et al.  The quest for scalable support of data-intensive workloads in distributed systems , 2009, HPDC '09.

[42]  Brent Welch,et al.  Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[43]  Fan Zhang,et al.  Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[44]  Scott Pakin,et al.  Exploring power behaviors and trade-offs of in-situ data analytics , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[45]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[46]  H. Apte,et al.  Serverless Network File Systems , 2006 .

[47]  Fan Zhang,et al.  Combining in-situ and in-transit processing to enable extreme-scale scientific analysis , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[49]  Ke Wang,et al.  A convergence of key‐value storage systems from clouds to supercomputers , 2016, Concurr. Comput. Pract. Exp..

[50]  Chen Shou,et al.  Towards a provenance-aware distributed filesystem , 2013 .

[51]  GhemawatSanjay,et al.  The Google file system , 2003 .

[52]  Ioan Raicu,et al.  Towards cost-effective and high-performance caching middleware for distributed systems , 2016, Int. J. Big Data Intell..

[53]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[54]  Karsten Schwan,et al.  GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[55]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[56]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[57]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[58]  Ioan Raicu,et al.  Towards high-performance and cost-effective distributed storage systems with information dispersal algorithms , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[59]  Chen Shou,et al.  Distributed data provenance for large-scale data-intensive computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[60]  I. Raicu,et al.  Storage Support for Data-Intensive Applications on Extreme-Scale HPC Systems , 2014 .

[61]  Ioan Raicu,et al.  HyCache: A User-Level Caching Middleware for Distributed File Systems , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.