论文信息 - Improving the Hadoop map/reduce framework to support concurrent appends through the BlobSeer BLOB management system

Improving the Hadoop map/reduce framework to support concurrent appends through the BlobSeer BLOB management system

Hadoop is a reference software framework supporting the Map/Reduce programming model. It relies on the Hadoop Distributed File System (HDFS) as its primary storage system. Although HDFS does not offer support for concurrently appending data to existing files, we argue that Map/Reduce applications as well as other classes of applications can benefit from such a functionality. We provide support for concurrent appends by building a concurrency-optimized data storage layer based on the BlobSeer data management service. Moreover, we modify the Hadoop Map/Reduce framework to use the append operation in the "reduce" phase of the application. To validate this work, we perform experiments on a large number of nodes of the Grid'5000 testbed. We demonstrate that massively concurrent append and read operations have a low impact on each other. Besides, measurements with an application available with Hadoop show that the support for concurrent appends to shared file is introduced with no extra cost, whereas the number of files managed by the Map/Reduced framework is substantially reduced.

[1] Gabriel Antoniu,et al. Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach , 2009, Euro-Par.

[2] GhemawatSanjay,et al. The Google file system , 2003 .

[3] Franck Cappello,et al. Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed , 2006, Int. J. High Perform. Comput. Appl..

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[6] Hairong Kuang,et al. The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7] Gabriel Antoniu,et al. BlobSeer: how to enable efficient versioning for large object storage under heavy access concurrency , 2009, EDBT/ICDT '09.

[8] Sean Quinlan,et al. GFS: Evolution on Fast-forward , 2009, ACM Queue.

[9] Gabriel Antoniu,et al. BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[10] Robert B. Ross,et al. PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[11] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.