A distributed data management middleware for data-driven application systems

A key challenge in supporting data-driven scientific applications is the storage and management of input and output data in a distributed environment. We describe a distributed storage middleware, based on a data and metadata management framework, to address this problem. In this middleware system, applications define the structure of their input and output data using XML schemas. The system provides support for 1) registration, versioning, management of schemas, and 2) management of storage, querying, and retrieval of instance data corresponding to the schemas in distributed databases. We carry out an experimental evaluation of the system on a set of PC clusters connected over wide- (WANs) and local-area networks (LANs).

[1]  Nicholas Carriero,et al.  Linda and Friends , 1986, Computer.

[2]  Joel H. Saltz,et al.  A simulation and data analysis system for large‐scale, data‐driven oil reservoir simulation studies , 2005, Concurr. Pract. Exp..

[3]  Alok N. Choudhary,et al.  DPFS: a distributed parallel file system , 2001, International Conference on Parallel Processing, 2001..

[4]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[5]  Joel H. Saltz,et al.  Active Proxy-G: Optimizing the Query Execution Process in the Grid , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[6]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[7]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[8]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[9]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[10]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[11]  Chandra Krintz,et al.  Running EveryWare on the Computational Grid , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[12]  Andrea C. Arpaci-Dusseau,et al.  Pipeline and batch sharing in grid workloads , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[13]  Micah Beck,et al.  The Internet Backplane Protocol: Storage in the Network , 1999 .

[14]  Yong Zhao,et al.  Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[15]  Mitsuhisa Sato,et al.  Ninf: A Network Based Information Library for Global World-Wide Computing Infrastructure , 1997, HPCN Europe.

[16]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[17]  Adam Arbree,et al.  Mapping Abstract Complex Workflows onto Grid Environments , 2003, Journal of Grid Computing.

[18]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[19]  Henri Casanova,et al.  Netsolve: a Network-Enabled Server for Solving Computational Science Problems , 1997, Int. J. High Perform. Comput. Appl..

[20]  Manish Parashar,et al.  Autonomic optimization of an oil reservoir using decentralized services , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..

[21]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.