A Metadata Catalog Service for Data Intensive Applications

Advances in computational, storage and network technologies as well as middle ware such as the Globus Toolkit allow scientists to expand the sophistication and scope of data-intensive applications. These applications produce and analyze terabytes and petabytes of data that are distributed in millions of files or objects. To manage these large data sets efficiently, metadata or descriptive information about the data needs to be managed. There are various types of metadata, and it is likely that a range of metadata services will exist in Grid environments that are specialized for particular types of metadata cataloguing and discovery. In this paper, we present the design of a Metadata Catalog Service (MCS) that provides a mechanism for storing and accessing descriptive metadata and allows users to query for data items based on desired attributes. We describe our experience in using the MCS with several applications and present a scalability study of the service.

[1]  Sonia Sharama,et al.  Grid Computing , 2004, Lecture Notes in Computer Science.

[2]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  Joshua R. Smith,et al.  LIGO: the Laser Interferometer Gravitational-Wave Observatory , 1992, Science.

[4]  Carl Kesselman,et al.  A Metadata Catalog Service for Data Intensive Applications , 2003, SC.

[5]  Erwin Laure,et al.  Replica Management in Data Grids , 2002 .

[6]  Yolanda Gil,et al.  Pegasus: Planning for Execution in Grids , 2002 .

[7]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[8]  Ian T. Foster,et al.  A community authorization service for group collaboration , 2002, Proceedings Third International Workshop on Policies for Distributed Systems and Networks.

[9]  Jason Lee,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[10]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[11]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[12]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[13]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[14]  Carl Kesselman,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[15]  Carl Kesselman,et al.  GriPhyN and LIGO, building a virtual data Grid for gravitational wave scientists , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[16]  Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 15-21 November 2003, Phoenix, AZ, USA, CD-Rom , 2003 .

[17]  Adam Arbree,et al.  Mapping Abstract Complex Workflows onto Grid Environments , 2003, Journal of Grid Computing.