Concentric Layout, a New Scientific Data Distribution Scheme in Hadoop File System

The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data fast and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a hierarchical data layout which maintains the dimensional property in large data sets. Contrary to the continuous data layout adopted in current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk, and then stores chunks symmetrically in a higher level. This matches well with the matrix like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments show that the concentric data layout improves the overall performance, reduces the execution time by about 38% when reading a 64 GB file. It also mitigates the unused data read overhead and increases the useful data efficiency by 32% on average.

[1]  J. Howard Et El,et al.  Scale and performance in a distributed file system , 1988 .

[2]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[3]  Randal E. Bryant,et al.  Data-Intensive Supercomputing: The case for DISC , 2007 .

[4]  Alok N. Choudhary,et al.  DPFS: a distributed parallel file system , 2001, International Conference on Parallel Processing, 2001..

[5]  S. Habib,et al.  Introducing map-reduce to high end computing , 2008, 2008 3rd Petascale Data Storage Workshop.

[6]  David R. O'Hallaron,et al.  High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers , 2003, SC.

[7]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.