We describe an infrastructure for integrating distributed data called DataSpace [3] In addition to supporting data and metadata, the infrastructure also supports globally unique keys for integrating data that we call universal keys. In contrast to some of the standard approaches to data integration, DataSpace does not try to achieve the full semantic integration of distributed data, but instead provides the minimum infrastructure necessary to integrate distributed data that is attached to universal keys. We also describe some applications that have been built with this infrastructure in astronomy, bioinformatics, and earth science. TCP-based web and grid services, as usually deployed, have been shown to have problems for integrating very large data sets over wide area, high performance networks [1]. Recently, we have implemented a peer-to-peer version of DataSpace called Sector that is designed for working with large data sets over wide area, high performance networks [7]. Sector has been used to distributed the terabyte size catalog data for the Sloan Digital Sky Survey (SDSS) from Chicago to locations in the U.S., Europe and Asia. This paper is based in part on [2]
[1]
Robert L. Grossman,et al.
Distributing the Sloan Digital Sky Survey Using UDT and Sector
,
2006,
2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).
[2]
Robert L. Grossman,et al.
Assigning Unique Keys to Chemical Compounds for Data Integration: Some Interesting Counter Examples
,
2005,
DILS.
[3]
David Maier,et al.
Principles of dataspace systems
,
2006,
PODS '06.
[4]
Robert L. Grossman,et al.
Data integration in a bandwidth-rich world
,
2003,
CACM.
[5]
Robert L. Grossman,et al.
An Empirical Study of the Universal Chemical Key Algorithm for Assigning Unique Keys to Chemical Compounds
,
2004,
J. Bioinform. Comput. Biol..
[6]
Robert L. Grossman,et al.
Data webs for earth science data
,
2003,
Parallel Comput..