Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS

Hadoop framework has been widely used in various clusters to build large scale, high performance systems. However, Hadoop distributed file system (HDFS) is designed to manage large files and suffers performance penalty while managing a large amount of small files. As a consequence, many web applications, like WebGIS, may not take benefits from Hadoop. In this paper, we propose an approach to optimize I/O performance of small files on HDFS. The basic idea is to combine small files into large ones to reduce the file number and build index for each file. Furthermore, some novel features such as grouping neighboring files and reserving several latest version of data are considered to meet the characteristics of WebGIS access patterns. Preliminary experiment results show that our approach achieves better performance.

[1]  R. S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[2]  M. Frans Kaashoek,et al.  Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files , 1997, USENIX Annual Technical Conference.

[3]  Adriana Iamnitchi,et al.  File grouping for scientific data management: lessons from experimenting with real traces , 2008, HPDC '08.

[4]  Kanad Ghose,et al.  hFS: a hybrid file system prototype for improving small file and metadata performance , 2007, EuroSys '07.

[5]  Thomas Ludwig,et al.  Directory-Based Metadata Optimizations for Small Files in PVFS , 2008, Euro-Par.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Peter Honeyman,et al.  Large files, small writes, and pNFS , 2006, ICS '06.

[10]  Garth A. Gibson,et al.  Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114) , 2008 .

[11]  Carl Staelin,et al.  An Implementation of a Log-Structured File System for UNIX , 1993, USENIX Winter.

[12]  Randal C. Burns,et al.  Group-based management of distributed file caches , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[13]  Gregory R. Ganger,et al.  Improving Small File Performance in Object-based Storage (CMU-PDL-06-104) , 2006 .

[14]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[15]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[16]  Kang-Tsung Chang,et al.  Introduction to Geographic Information Systems , 2001 .

[17]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[18]  GhemawatSanjay,et al.  The Google file system , 2003 .

[19]  Douglas Thain,et al.  Efficient access to many small files in a filesystem for grid computing , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[20]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[21]  RosenblumMendel,et al.  The design and implementation of a log-structured file system , 1991 .