PortHadoop: Support direct HPC data processing in Hadoop

The success of the Hadoop MapReduce programming model has greatly propelled research in big data analytics. In recent years, there is a growing interest in the High Performance Computing (HPC) community to use Hadoop-based tools for processing scientific data. This interest is due to the facts that data movement becomes prohibitively expensive, highperformance data analytic becomes an important part of HPC, and Hadoop-based tools can perform large-scale data processing in a time and budget efficient manner. In this study, we propose PortHadoop, an enhanced Hadoop architecture that enables MapReduce applications reading data directly from HPC parallel file systems (PFS). PortHadoop saves HDFS storage space, and, more importantly, avoids the otherwise costly data copying. PortHadoop keeps all the semantics in the original Hadoop system and PFS. Therefore, Hadoop MapReduce applications can run on PortHadoop without code change except that the input file location is in PFS rather than HDFS. Our experimental results show that PortHadoop can operate effectively and efficiently with the PVFS2 and Ceph file systems.

[1]  Jun Wang,et al.  MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns , 2010, HPDC '10.

[2]  Amin Vahdat,et al.  Themis: an I/O-efficient MapReduce , 2012, SoCC '12.

[3]  Onkar Sahni,et al.  Massively Parallel I/O for Partitioned Solver Systems , 2010, Parallel Process. Lett..

[4]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[6]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[7]  Michela Taufer,et al.  Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce , 2013, BMC Structural Biology.

[8]  Xu Liu,et al.  Towards Hybrid Programming in Big Data , 2015, HotCloud.

[9]  Constantinos Evangelinos,et al.  Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere- , 2008 .

[10]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[11]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[12]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[13]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[14]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[15]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Carlos Maltzahn,et al.  SciHadoop: Array-based query processing in Hadoop , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[18]  Steven Hand,et al.  Proceedings of the Third ACM Symposium on Cloud Computing , 2012, SOCC 2012.

[19]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[21]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[22]  Carlos Maltzahn,et al.  Ceph as a Scalable Alternative to the Hadoop Distributed File System , 2010, login Usenix Mag..

[23]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[24]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.