Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms

The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.

[1]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[4]  Zhao Zhang,et al.  Scientific computing meets big data technology: An astronomy use case , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[5]  Daniel S. Katz,et al.  Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking , 2009, Int. J. Comput. Sci. Eng..

[6]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[7]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[8]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[9]  Holger Gohlke,et al.  The Amber biomolecular simulation programs , 2005, J. Comput. Chem..

[10]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[11]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[12]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[13]  Wu-chun Feng,et al.  Massively parallel genomic sequence search on the Blue Gene/P architecture , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[15]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Osamu Tatebe,et al.  Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.