Chimera: a virtual data system for representing, querying, and automating data derivation

A lot of scientific data is not obtained from measurements but rather derived from other data by the application of computational procedures. We hypothesize that explicit representation of these procedures can enable documentation of data provenance, discovery of available methods, and on-demand data generation (so-called "virtual data"). To explore this idea, we have developed the Chimera virtual data system, which combines a virtual data catalog for representing data derivation procedures and derived data, with a virtual data language interpreter that translates user requests into data definition and query operations on the database. We couple the Chimera system with distributed "data grid" services to enable on-demand execution of computation schedules constructed from database queries. We have applied this system to two challenge problems, the reconstruction of simulated collision event data from a high-energy physics experiment, and searching digital sky survey data for galactic clusters, with promising results.

[1]  Alexander S. Szalay,et al.  The world-wide telescope , 2001, CACM.

[2]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[3]  Miron Livny,et al.  Zoo: a desktop experiment management environment , 1997, SIGMOD '97.

[4]  Vincenzo Innocente,et al.  CMS software architecture - Software framework, services and persistency in high level trigger, reconstruction and analysis , 2001 .

[5]  Miron Livny,et al.  Conceptual Schemas: Multi-faceted Tools for Desktop Scientific Experiment Management , 1992, Int. J. Cooperative Inf. Syst..

[6]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[7]  Paul Avery,et al.  The griphyn project: towards petascale virtual data grids , 2001 .

[8]  Carl Kesselman,et al.  GriPhyN and LIGO, building a virtual data Grid for gravitational wave scientists , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[9]  Ewa Deelman,et al.  Transformation Catalog Design for GriPhyN , 2001 .

[10]  Brian Tierney,et al.  File and Object Replication in Data Grids , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[11]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[12]  James Annis et al. Applying chimera virtual data concepts to cluster finding in the Sloan Sky Survey , 2002 .

[13]  Keishi Tajima,et al.  Archiving scientific data , 2002, SIGMOD '02.

[14]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[15]  T. Virdee The CMS Experiment at the LHC , 1999 .

[16]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[17]  Donald D. Eisenstein,et al.  The maxBCG technique for finding galaxy clusters in SDSS data , 1999 .

[18]  Gustavo Alonso,et al.  Letter from the Special Issue Editor , 1995, IEEE Data Eng. Bull..

[19]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[20]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[21]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[22]  A. El Abbadi,et al.  Exotica: a project on advanced transaction management and workflow systems , 1995, SIGO.

[23]  Frank Leymann,et al.  Managing Business Processes an an Information Resource , 1994, IBM Syst. J..

[24]  I-Min A. Chen,et al.  Constructing and maintaining scientific database views in the framework of the object-protocol model , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[25]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[26]  Ian Foster,et al.  Representing Virtual Data: A Catalog Architecture for Location and Materialization Trans-parency , 2001 .

[27]  Robert Gardner,et al.  An International Virtual-Data Grid Laboratory for Data Intensive Science , 2001 .

[28]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[29]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[30]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).