DSToolkit: An Architecture for Flexible Dataspace Management

The vision of dataspaces is to provide various of the benefits of classical data integration, but with reduced up-front costs. Combining this with opportunities for incremental refinement enables a ‘pay-as-you-go' approach to data integration, resulting in simplified integrated access to distributed data. It has been speculated that model management could provide the basis for Dataspace Management, however, this has not been investigated until now. Here, we present DSToolkit, the first dataspace management system that is based on model management, and therefore, benefits from the flexibility provided by the approach for the management of schemas represented in heterogeneous models, supports the complete dataspace lifecycle, which includes automatic initialisation, maintenance and improvement of a dataspace, and allows the user to provide feedback by annotating result tuples returned as a result of queries the user has posed. The user feedback gathered is utilised for improvement by annotating, selecting and refining mappings. Without the need for additional feedback on a new data source, these techniques can also be applied to determine its perceived quality with respect to already gathered feedback and to identify the best mappings over all sources including the new one.

[1]  Wang Chiew Tan,et al.  Debugging schema mappings with routes , 2006, VLDB.

[2]  Norman W. Paton,et al.  The design and implementation of OGSA-DQP: A service-based distributed query processor , 2009, Future Gener. Comput. Syst..

[3]  Erhard Rahm,et al.  Matching large schemas: Approaches and evaluation , 2007, Inf. Syst..

[4]  Norman W. Paton,et al.  Dimensions of Dataspaces , 2009, BNCOD.

[5]  Norman W. Paton,et al.  User Feedback as a First Class Citizen in Information Integration Systems , 2011, CIDR.

[6]  Richard Y. Wang,et al.  A product perspective on total data quality management , 1998, CACM.

[7]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[8]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[9]  Laura M. Haas,et al.  Data integration through database federation , 2002, IBM Syst. J..

[10]  Peter McBrien,et al.  AutoMed Model Management , 2008, ER.

[11]  Arnon Rosenthal,et al.  The Harmony Integration Workbench , 2008, J. Data Semant..

[12]  Won Kim,et al.  On resolving schematic heterogeneity in multidatabase systems , 1995, Distributed and Parallel Databases.

[13]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[14]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[15]  Helmut Seidl,et al.  Exact XML Type Checking in Polynomial Time , 2007, ICDT.

[16]  Alexandra Poulovassilis,et al.  P2P Query Reformulation over Both-As-View Data Transformation Rules , 2006, DBISP2P.

[17]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[18]  Norman W. Paton,et al.  Feedback-based annotation, selection and refinement of schema mappings for dataspaces , 2010, EDBT '10.

[19]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[20]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[21]  Partha Pratim Talukdar,et al.  The ORCHESTRA Collaborative Data Sharing System , 2008, SIGMOD Rec..

[22]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[23]  Andrew B. Whinston,et al.  Model management , 1994 .

[24]  Norman W. Paton,et al.  Utilising the MISM Model Independent Schema Management Platform for Query Evaluation , 2011, BNCOD.

[25]  Partha Pratim Talukdar,et al.  Automatically incorporating new sources in keyword search-based data integration , 2010, SIGMOD Conference.

[26]  Norman W. Paton,et al.  Dataspaces , 2009, SeCO Workshop.

[27]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[28]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[29]  Roberto Baldoni,et al.  The architecture: a platform for exchanging and improving data quality in cooperative information systems , 2004, Inf. Syst..

[30]  AnHai Doan,et al.  Integrating data from disparate sources: a mass collaboration approach , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  Jeffrey F. Naughton,et al.  Efficiently incorporating user feedback into information extraction and integration programs , 2009, SIGMOD Conference.

[32]  Christian S. Jensen,et al.  Capturing Temporal Constraints in Temporal ER Models , 2008, ER.

[33]  Alon Y. Halevy,et al.  A Platform for Personal Information Management and Integration , 2005, CIDR.

[34]  Koby Crammer,et al.  Learning to create data-integrating queries , 2008, Proc. VLDB Endow..

[35]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[36]  Felix Naumann,et al.  Quality-driven Integration of Heterogenous Information Systems , 1999, VLDB.

[37]  Laura M. Haas,et al.  Beauty and the Beast: The Theory and Practice of Information Integration , 2007, ICDT.

[38]  Jessie Kennedy,et al.  Advances in Databases , 1996, Lecture Notes in Computer Science.

[39]  Paolo Atzeni,et al.  A Universal Metamodel and Its Dictionary , 2009, Trans. Large Scale Data Knowl. Centered Syst..

[40]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[41]  Phokion G. Kolaitis,et al.  Interactive generation of integrated schemas , 2008, SIGMOD Conference.

[42]  Erhard Rahm,et al.  Rondo: a programming platform for generic model management , 2003, SIGMOD '03.

[43]  Wenfei Fan,et al.  Conditional Dependencies: A Principled Approach to Improving Data Quality , 2009, BNCOD.

[44]  Philip A. Bernstein,et al.  Model management 2.0: manipulating richer mappings , 2007, SIGMOD '07.

[45]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[46]  K. Selçuk Candan,et al.  Feedback-driven result ranking and query refinement for exploring semi-structured data collections , 2010, EDBT '10.

[47]  Jayant Madhavan,et al.  OpenII: an open source information integration toolkit , 2010, SIGMOD Conference.

[48]  Norman W. Paton,et al.  Chapter 7: dataspaces , 2010 .

[49]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[50]  Norman W. Paton,et al.  Defining and Using Schematic Correspondences for Automatically Generating Schema Mappings , 2009, CAiSE.

[51]  Philip A. Bernstein,et al.  Industrial-strength schema matching , 2004, SGMD.

[52]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[53]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[54]  Jens Dittrich,et al.  iMeMex: From Search to Information Integration and Back , 2009, IEEE Data Eng. Bull..

[55]  Matthias Jarke,et al.  Generic schema mappings for composition and query answering , 2009, Data Knowl. Eng..

[56]  Luigi Bellomarini,et al.  MISM: A Platform for Model-Independent Solutions to Model Management Problems , 2009, J. Data Semant..