MARIANE: MApReduce Implementation Adapted for HPC Environments

MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typically support globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper not only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan test bed at the National Energy Research Scientific Computing Center (NERSC).

[1]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[3]  John-Paul Navarro TeraGrid information services , 2007 .

[4]  Matthew T. O'Keefe,et al.  The Global File System: A File System for Shared Disk Storage , 1997 .

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7]  M. G. Lattanzi,et al.  The APACHE Project , 2013, 1303.1275.

[8]  GhemawatSanjay,et al.  The Google file system , 2003 .

[9]  Wei Zhang,et al.  A Table-Driven Streaming XML Parsing Methodology for High-Performance Web Services , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[10]  Hai Jin,et al.  CLOUDLET: towards mapreduce implementation on virtual machines , 2009, HPDC '09.

[11]  Pu Liu,et al.  A Benchmark Suite for SOAP-based Communication in Grid Web Services , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  安藤 一秋,et al.  Google Web APIs を利用した英文作成支援ツール , 2006 .

[13]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[14]  Madhusudhan Govindaraju,et al.  DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[16]  Lavanya Ramakrishnan,et al.  Magellan: experiences from a science cloud , 2011, ScienceCloud '11.

[17]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[18]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[19]  Madhusudhan Govindaraju,et al.  LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[20]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).