Could Blobs Fuel Storage-Based Convergence Between HPC and Big Data?

The increasingly growing data sets processed on HPC platforms raise major challenges for the underlying storage layer. A promising alternative to POSIX-IO-compliant file systems are simpler blobs (binary large objects), or object storage systems. They offer lower overhead and better performance at the cost of largely unused features such as file hierarchies or permissions. Similarly, blobs are increasingly considered for replacing distributed file systems for big data analytics or as a base for storage abstractions like key-value stores or time-series databases. This growing interest in such object storage on HPC and big data platforms raises the question: Are blobs the right level of abstraction to enable storage-based convergence between HPC and Big Data? In this paper we take a first step towards answering the question by analyzing the applicability of blobs for both platforms.

[1]  Wei Lu,et al.  AzureBlast: a case study of developing science applications on the cloud , 2010, HPDC '10.

[2]  Robert Latham,et al.  The Impact of File Systems on MPI-IO Scalability , 2004, PVM/MPI.

[3]  Thomas Ludwig,et al.  A Best Practice Analysis of HDF 5 5 and NetCDF- 4 4 Using Lustre , 2015, ISC.

[4]  José Luis Vázquez-Poletti,et al.  A Cloud for Clouds: Weather Research and Forecasting on a Public Cloud Infrastructure , 2014, CLOSER.

[5]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[6]  Garth Gibson,et al.  pWalrus: Towards better integration of parallel file systems into cloud storage , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[7]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[8]  Robert B. Ross,et al.  Bridging HPC and Grid File I/O with IOFSL , 2010, PARA.

[9]  Osamu Tatebe,et al.  Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[10]  Harald Richter,et al.  High Performance Computing in a Cloud Using OpenStack , 2014, CLOUD 2014.

[11]  Nicholas Mills,et al.  OrangeFS : Advancing PVFS , 2011 .

[12]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[13]  Robert Ross,et al.  Chapter 30 Storage Models : Past , Present , and Future , 2014 .

[14]  Robert Ross,et al.  Extending the POSIX I/O interface: a parallel file system perspective. , 2008 .

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[17]  Hai Jin,et al.  Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems , 2015, ICA3PP.

[18]  Gabriel Antoniu,et al.  BlobSeer: Next-generation data management for large scale infrastructures , 2011, J. Parallel Distributed Comput..

[19]  Nikos Parlavantzas,et al.  Efficient execution of the WRF model and other HPC applications in the cloud , 2016, Earth Science Informatics.

[20]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[21]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[22]  Dror G. Feitelson,et al.  Overview of the MPI-IO Parallel I/O Interface , 1996, Input/Output in Parallel and Distributed Computer Systems.

[23]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[24]  Michael Kuhn,et al.  A Semantics-Aware I/O Interface for High Performance Computing , 2013, ISC.

[25]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  GhemawatSanjay,et al.  The Google file system , 2003 .

[27]  Abhishek Gupta,et al.  Evaluation of HPC Applications on Cloud , 2011, 2011 Sixth Open Cirrus Summit.

[28]  Michael Kuhn,et al.  Dynamically Adaptable I/O Semantics for High Performance Computing , 2015, ISC.

[29]  Emmanuel Jeannot,et al.  Adding Virtualization Capabilities to the Grid'5000 Testbed , 2012, CLOSER.

[30]  Stephen P. Crago,et al.  Integrating High Performance File Systems in a Cloud Computing Environment , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[31]  Philippe Olivier Alexandre Navaux,et al.  Porting a Numerical Atmospheric Model to a Cloud Service , 2015, CARLA.

[32]  Satish Narayana Srirama,et al.  Stratus: A Distributed Computing Framework for Scientific Simulations on the Cloud , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[33]  Dejan S. Milojicic,et al.  The Who, What, Why, and How of High Performance Computing in the Cloud , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[34]  Zhao Zhang,et al.  Scientific computing meets big data technology: An astronomy use case , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[35]  Jesús Montes,et al.  Týr: Blob Storage Meets Built-In Transactions , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[37]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[38]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[39]  Seo-Young Noh,et al.  Cloud Computing: Read Before Use , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[40]  Jay F. Lofstead,et al.  Insights for exascale IO APIs from building a petascale IO API , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[41]  G. Aloisioa,et al.  Scientific big data analytics challenges at large scale , 2013 .

[42]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[43]  Jian Zhou,et al.  UNIO: A Unified I/O System Framework for Hybrid Scientific Workflow , 2015, CloudCom-Asia.