Applying database support for large scale data driven science in distributed environments

There is a rapidly growing set of applications, referred to as data driven applications, in which analysis of large amounts of data drives the next steps taken by the scientist, e.g., running new simulations, doing additional measurements, extending the analysis to larger data collections. Critical steps in data analysis are to extract the data of interest from large and potentially distributed datasets and to move it from storage clusters to compute clusters for processing. We have developed a middleware framework, called GridDB-Lite, that is designed to efficiently support these two steps. We describe the application of GridDB-Lite in large scale oil reservoir simulation studies and experimentally evaluate several optimizations that can be employed in the GridDB-Lite runtime system.

[1]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[2]  Joel H. Saltz,et al.  Persistent Caching in a Multiple Query Optimization Framework , 2002 .

[3]  Scott B. Baden,et al.  Efficient Run-Time Support for Irregular Block-Structured Applications , 1998, J. Parallel Distributed Comput..

[4]  Joel H. Saltz,et al.  Driving Scientific Applications by Data in Distributed Environments , 2003, International Conference on Computational Science.

[5]  Rajeev Thakur,et al.  Passion: Optimized I/O for Parallel Applications , 1996, Computer.

[6]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[7]  Jim Smith,et al.  Distributed Query Processing on the Grid , 2003, Int. J. High Perform. Comput. Appl..

[8]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[9]  Joel H. Saltz,et al.  Database Support for Data-Driven Scientific Applications in the Grid , 2003, Parallel Process. Lett..

[10]  Joel H. Saltz,et al.  Multiple Query Optimization for Data Analysis Applications on Clusters of SMPs , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[11]  Vijayshankar Raman,et al.  Data Access and Management Services on Grid , 2002 .

[12]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[13]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.