Processing queries and merging schemas in support of data integration

The goal of data integration is to provide a uniform interface, called a mediated schema, to a set of autonomous data sources, which allows users to query a set of databases without knowing the schemas of the underlying data sources. This thesis describes two aspects of data integration: an algorithm for answering queries posed to a mediated schema and the process of creating a mediated schema. First, we present the MiniCon algorithm for answering queries in a data integration system and explain why MiniCon outperforms previous algorithms by up to several orders of magnitude. Second, given two relational schemas for data sources, we propose an approach for using conjunctive queries to describe mappings between them. We analyze their formal semantics, show how to derive a mediated schema based on such mappings, and show how to translate user queries over the mediated schema into queries over local schemas. We then show a generic Merge operator that merges schemas and mappings regardless of data model or application. Finally, we show how to implement the derivation of mediated schemas using the generic Merge operator.

[1]  Yang Wen Semantic integration of structured and semistructured data sources , 2002 .

[2]  Catriel Beeri,et al.  Schemas for Integration and Translation of Structured and Semi-structured Data , 1999, ICDT.

[3]  Michael R. Genesereth,et al.  Answering recursive queries using views , 1997, PODS '97.

[4]  Joachim Biskup,et al.  A formal view integration method , 1986, SIGMOD '86.

[5]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[6]  Anthony Kosky,et al.  Theoretical Aspects of Schema Merging , 1992, EDBT.

[7]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[8]  Jeffrey D. Ullman,et al.  Principles Of Database And Knowledge-Base Systems , 1979 .

[9]  Patrick Valduriez,et al.  A Methodology for Query Reformulation in CIS Using Semantic Knowledge , 1996, Int. J. Cooperative Inf. Syst..

[10]  Daniel S. Weld,et al.  Planning to Gather Information , 1996, AAAI/IAAI, Vol. 1.

[11]  Chen Li,et al.  Generating efficient plans for queries using views , 2001, SIGMOD '01.

[12]  Michael R. Genesereth,et al.  Query planning in infomaster , 1997, SAC '97.

[13]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[14]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[15]  Mark A. Musen,et al.  PROMPT: Algorithm and tool for ontology merging and alignment , 2000, AAAI 2000.

[16]  Thierry Barsalou,et al.  M(DM): an open framework for interoperation of multimodel multidatabase systems , 1992, [1992] Eighth International Conference on Data Engineering.

[17]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[18]  Manolis Gergatsoulis,et al.  Answering Queries Using Materialized Views with Disjunctions , 1999, ICDT.

[19]  Mihalis Yannakakis,et al.  Equivalences Among Relational Expressions with the Union and Difference Operators , 1980, J. ACM.

[20]  Renée J. Miller Using schematically heterogeneous structures , 1998, SIGMOD '98.

[21]  Xiaolei Qian,et al.  Query folding , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[22]  Todd D. Millstein,et al.  Navigational Plans For Data Integration , 1999, AAAI/IAAI.

[23]  Thomas Berlage,et al.  A framework for shared applications with a replicated architecture , 1993, UIST '93.

[24]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[25]  Fausto Giunchiglia,et al.  Data Management for Peer-to-Peer Computing : A Vision , 2002, WebDB.

[26]  Stefano Spaccapietra,et al.  View Integration: A Step Forward in Solving Structural Conflicts , 1994, IEEE Trans. Knowl. Data Eng..

[27]  Philip A. Bernstein,et al.  Merging Models Based on Given Correspondences , 2003, VLDB.

[28]  Z. Meral Özsoyoglu,et al.  On Efficient Reasoning with Implication Constraints , 1993, DOOD.

[29]  Alin Deutsch,et al.  A chase too far , 2000, SIGMOD 2000.

[30]  Jayant Madhavan,et al.  Composing Mappings Among Data Sources , 2003, VLDB.

[31]  Alon Y. Halevy,et al.  Queries Independent of Updates , 1993, VLDB.

[32]  Alexandra Poulovassilis,et al.  Data integration by bi-directional schema transformation rules , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[33]  Diego Calvanese,et al.  Rewriting of regular expressions and regular path queries , 1999, PODS '99.

[34]  Per-Åke Larson,et al.  Query Transformation for PSJ-Queries , 1987, VLDB.

[35]  Anand Rajaraman,et al.  Answering queries using templates with binding patterns (extended abstract) , 1995, PODS.

[36]  Surajit Chaudhuri,et al.  On the Equivalence of Recursive and Nonrecursive Datalog Programs , 1997, J. Comput. Syst. Sci..

[37]  Gerd Stumme,et al.  FCA-MERGE: Bottom-Up Merging of Ontologies , 2001, IJCAI.

[38]  Masatoshi Yoshikawa,et al.  ILOG: Declarative Creation and Manipulation of Object Identifiers , 1990, VLDB.

[39]  Philip A. Bernstein,et al.  Challenges in Precisely Aligning Models of Human Anatomy Using Generic Schema Matching , 2004, MedInfo.

[40]  Marc Friedman,et al.  Efficiently Executing Information-Gathering Plans , 1997, IJCAI.

[41]  Zachary G. Ives,et al.  Efficient query processing for data integration , 2002 .

[42]  Richard Hull,et al.  Relative information capacity of simple relational database schemata , 1984, SIAM J. Comput..

[43]  Mehrdad Sabetzadeh,et al.  Analysis of inconsistency in graph-based viewpoints: a category-theoretical approach , 2003, 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings..

[44]  Phokion G. Kolaitis,et al.  On the complexity of the containment problem for conjunctive queries with built-in predicates , 1998, PODS '98.

[45]  Oded Shmueli,et al.  Equivalence of DATALOG Queries is Undecidable , 1993, J. Log. Program..

[46]  Paolo Atzeni,et al.  Management of Multiple Models in an Extensible Database Design Tool , 1996, EDBT.

[47]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[48]  Mihalis Yannakakis,et al.  Equivalence among Relational Expressions with the Union and Difference Operation , 1978, VLDB.

[49]  Arnon Rosenthal,et al.  Tools and transformations—rigorous and otherwise—for practical database design , 1994, TODS.

[50]  Marvin H. Solomon,et al.  The GMAP: a versatile tool for physical data independence , 1996, The VLDB Journal.

[51]  Surajit Chaudhuri,et al.  On the complexity of equivalence between recursive and nonrecursive Datalog programs , 1994, PODS '94.

[52]  Philip A. Bernstein,et al.  A vision for management of complex models , 2000, SGMD.

[53]  Karl Aberer,et al.  A framework for semantic gossiping , 2002, SGMD.

[54]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[55]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[56]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[57]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[58]  Leonid A. Kalinichenko,et al.  Methods and Tools for Equivalent Data Model Mapping Construction , 1990, EDBT.

[59]  Jarek Gryz,et al.  Query folding with inclusion dependencies , 1998, Proceedings 14th International Conference on Data Engineering.

[60]  Andrea Calì,et al.  On the Expressive Power of Data Integration Systems , 2002, ER.

[61]  Prasun Dewan,et al.  A flexible object merging framework , 1994, CSCW '94.

[62]  Guido Moerkotte,et al.  Heuristic and randomized optimization for the join ordering problem , 1997, The VLDB Journal.

[63]  Beng Chin Ooi,et al.  Relational data sharing in peer-based data management systems , 2003, SGMD.

[64]  Erhard Rahm,et al.  Rondo: a programming platform for generic model management , 2003, SIGMOD '03.

[65]  Rada Chirkova,et al.  A formal perspective on the view selection problem , 2002, The VLDB Journal.

[66]  Ronald Fagin,et al.  Composing schema mappings: second-order dependencies to the rescue , 2004, PODS 2004.

[67]  Timos K. Sellis,et al.  Data Warehouse Configuration , 1997, VLDB.

[68]  Maurizio Rafanelli,et al.  Querying aggregate data , 1999, PODS '99.

[69]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[70]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[71]  Deborah L. McGuinness,et al.  The Chimaera Ontology Environment , 2000, AAAI/IAAI.

[72]  Gustaf Neumann,et al.  Coordination Technology for Collaborative Applications: Organizations, Processes, and Agents , 1998 .

[73]  Oliver M. Duschka,et al.  Query Planning with Disjunctive Sources , 1998 .

[74]  Gregor Kiczales,et al.  Aspect-oriented programming , 1996, CSUR.

[75]  Mohamed Ziauddin,et al.  Materialized Views in Oracle , 1998, VLDB.

[76]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[77]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[78]  Benjamin C. Pierce,et al.  What is a file synchronizer? , 1998, MobiCom '98.

[79]  Harold Ossher,et al.  Specifying Subject-Oriented Composition , 1996, Theory Pract. Object Syst..

[80]  Hamid Pirahesh,et al.  Answering complex SQL queries using automatic summary tables , 2000, SIGMOD '00.

[81]  Werner Nutt,et al.  Rewriting aggregate queries using views , 1999, PODS.

[82]  Daniel S. Weld,et al.  Planning to gather inforrnation , 1996, AAAI 1996.

[83]  Linda G. Shapiro,et al.  The digital anatomist foundational model: principles for defining and structuring its concept domain , 1998, AMIA.

[84]  Janis A. Bubenko,et al.  Semantic Similarity Relations in Schema Integration , 1992, ER.

[85]  Serge Abiteboul,et al.  Complexity of answering queries using materialized views , 1998, PODS.

[86]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[87]  Prasenjit Mitra An algorithm for answering queries efficiently using views , 2001, ADC.

[88]  Alon Y. Halevy,et al.  Efficient query reformulation in peer data management systems , 2004, SIGMOD '04.

[89]  Verena Kantere,et al.  The hyperion project: from data integration to data coordination , 2003, SGMD.

[90]  Alon Y. Halevy,et al.  Piazza: data management infrastructure for semantic web applications , 2003, WWW '03.

[91]  Richard Fikes,et al.  The Ontolingua Server: a tool for collaborative ontology construction , 1997, Int. J. Hum. Comput. Stud..

[92]  Philip A. Bernstein,et al.  Applying Model Management to Classical Meta Data Problems , 2003, CIDR.

[93]  Alberto O. Mendelzon,et al.  Tableau Techniques for Querying Information Sources through Global Schemas , 1999, ICDT.

[94]  Ronald Fagin,et al.  Composing schema mappings: second-order dependencies to the rescue , 2004, PODS '04.

[95]  Gerd Stumme,et al.  Fast Computation of Concept lattices Using Data Mining Techniques , 2000, KRDB.

[96]  Yehoshua Sagiv,et al.  Optimizing datalog programs , 1987, Foundations of Deductive Databases and Logic Programming..

[97]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[98]  Gilad Bracha,et al.  Modularity meets inheritance , 1992, Proceedings of the 1992 International Conference on Computer Languages.

[99]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[100]  Divesh Srivastava,et al.  Answering Queries Using Views. , 1999, PODS 1995.

[101]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[102]  Anthony C. Klug On conjunctive queries containing inequalities , 1988, JACM.

[103]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[104]  Mark A. Musen,et al.  SMART: Automated Support for Ontology Merging and Alignment , 1999 .

[105]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[106]  Gao Jun,et al.  QUERY REWRITING FOR SEMI-STRUCTURED DATA , 2002 .

[107]  Alon Y. Halevy,et al.  Recursive Plans for Information Gathering , 1997, IJCAI.

[108]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[109]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[110]  Harold Ossher,et al.  Subject-oriented programming: a critique of pure objects , 1993, OOPSLA '93.

[111]  Richard Fikes,et al.  Ontologies: What Are They, and Where's The Research? , 1996, KR.

[112]  Harold Ossher,et al.  Combination of Inheritance Hierarchies , 1992, OOPSLA.