Assessing Data Virtualization for Irregularly Replicated Large Datasets

Large volumes of data are generated every day by experiments, simulations and all sorts of applications. It is common to observe situations where portions of data are irregularly replicated and distributed in different data sources. It would be desirable to be able to handle these several pieces of irregular data (replicated or not) as a unique large dataset. This is called data virtualization and is the focus of this paper. In this paper, we present a system which is capable of dealing with irregularly replicated data and is able to create a virtual view of the union of the individual irregular portions of data hosted by each data source. Our system indexes the data intervals from each data source and allows clients to submit queries against the virtual dataset created. In order to select what server will be responsible for each data interval of a query, we use and compare three algorithms, namely Random, Round-Robin and Weighted Round-Robin. The comparison is driven by simulation and the parameters for the simulation are all taken from a real data-centered application (the Virtual Microscope).

[1]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[2]  Joel H. Saltz,et al.  The virtual microscope , 2003, IEEE Transactions on Information Technology in Biomedicine.

[3]  Joel H. Saltz,et al.  An approach for automatic data virtualization , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[4]  Soon Myoung Chung,et al.  Semantic-Based Access Control for Grid Data Resources in Open Grid Services Architecture - Data Access and Integration (OGSA-DAI) , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[5]  Norman W. Paton,et al.  OGSA-DQP: A Service for Distributed Querying on the Grid , 2004, EDBT.

[6]  Paul Watson,et al.  OGSA-DQP: A grid service for distributed query-ing on the grid , 1979 .

[7]  Pete Wyckoff,et al.  A parallel I/O mechanism for distributed systems , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[8]  Erhard Rahm,et al.  Dynamic Multi-Resource Load Balancing in Parallel Database Systems , 1995, VLDB.

[9]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[10]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[11]  Reagan Moore,et al.  Virtualization Services for Data Grids , 2003 .

[12]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[13]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[14]  Ian Foster,et al.  The Globus toolkit , 1998 .

[15]  Joseph O'Rourke,et al.  Computational Geometry in C. , 1995 .

[16]  Clement T. Yu,et al.  Distributed query processing , 1984, CSUR.

[17]  J. O´Rourke,et al.  Computational Geometry in C: Arrangements , 1998 .

[18]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[19]  Ian T. Foster,et al.  Data management and transfer in high-performance computational grid environments , 2002, Parallel Comput..

[20]  Paul Watson Databases in Grid Applications: Locality and Distribution , 2005, BNCOD.

[21]  Soon M. Chung,et al.  Role-based access control for the open grid services architecture-data access and integration (ogsa-dai) , 2007 .

[22]  Xi Zhang,et al.  Applying database support for large scale data driven science in distributed environments , 2003, Proceedings. First Latin American Web Congress.