Data Integration Via Universal Keys

We describe an infrastructure for integrating distributed data called DataSpace [3] In addition to supporting data and metadata, the infrastructure also supports globally unique keys for integrating data that we call universal keys. In contrast to some of the standard approaches to data integration, DataSpace does not try to achieve the full semantic integration of distributed data, but instead provides the minimum infrastructure necessary to integrate distributed data that is attached to universal keys. We also describe some applications that have been built with this infrastructure in astronomy, bioinformatics, and earth science. TCP-based web and grid services, as usually deployed, have been shown to have problems for integrating very large data sets over wide area, high performance networks [1]. Recently, we have implemented a peer-to-peer version of DataSpace called Sector that is designed for working with large data sets over wide area, high performance networks [7]. Sector has been used to distributed the terabyte size catalog data for the Sloan Digital Sky Survey (SDSS) from Chicago to locations in the U.S., Europe and Asia. This paper is based in part on [2]