Concentric layout, a new scientific data layout for matrix data-set in Hadoop file system

Due to the explosive growth in the size of scientific data-sets, data-intensive computing and analysing are an emerging trend in computational science. In these applications, data pre-processing is widely adopted because it can optimise the data layout or format beforehand to facilitate the future data access. On the other hand, current research shows an increasing popularity of MapReduce framework for large-scale data processing. However, the data access patterns which are generally applied to scientific data-set are not supported by current MapReduce framework directly. This gap motivates us to provide support for these scientific data access patterns in MapReduce framework. In our work, we study the data access patterns in matrix files and propose a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout adopted in the current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk. This layout can guarantee that the average performance of data access is optimal regardless of the various access patterns. The concentric data layout requires reorganising the data before it is being analysed or processed. Our experiments are launched on a real-world halo-finding application; the results indicate that the concentric data layout improves the overall performance by up to 38%.

[1]  S. Habib,et al.  Introducing map-reduce to high end computing , 2008, 2008 3rd Petascale Data Storage Workshop.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[5]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  Kwan-Liu Ma,et al.  Visualizing Very Large-Scale Earthquake Simulations , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Byungkook Lee,et al.  Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences , 2005, ISMB.

[8]  Jim Gray,et al.  To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? , 2007, ArXiv.

[9]  August 29-September 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[10]  J. Mark Ware,et al.  An object-oriented approach to the representation of spatiotemporal geographic features , 2007, GIS.

[11]  Fernando Diaz,et al.  A case study of using geographic cues to predict query news intent , 2009, GIS.

[12]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[13]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[14]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[15]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[16]  H. Ritzdorf,et al.  Fast Parallel Non-Contiguous File Access , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Werner Vogels,et al.  Data Access Patterns in The Amazon.com Technology Platform , 2007, VLDB.

[18]  Alok N. Choudhary,et al.  DPFS: a distributed parallel file system , 2001, International Conference on Parallel Processing, 2001..

[19]  Tiankai Tu,et al.  High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[20]  J. L. Rana,et al.  Notice of Violation of IEEE Publication PrinciplesEstimating neutral divergence amongst Mammals for Comparative Genomics with Mammalian scope , 2006, 9th International Conference on Information Technology (ICIT'06).

[21]  David R. O'Hallaron,et al.  High Resolution Forward And Inverse Earthquake Modeling on Terascale Computers , 2003, SC.

[22]  Ivan Simeonov,et al.  Algorithmic realization of system for short-term weather forecasting , 2007, CompSysTech '07.

[23]  Chau-Wen Tseng,et al.  Compiler and Run-Time Support for Improving Locality in Scientific Codes , 1999, LCPC.

[24]  C. Mohan,et al.  Coordinating backup/recovery and data consistency between database and file systems , 2002, SIGMOD '02.

[25]  Randal E. Bryant,et al.  Data-Intensive Supercomputing: The case for DISC , 2007 .

[26]  Jun Wang,et al.  MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns , 2010, HPDC '10.