Extensible and similarity-based grouping for data integration

The general concept of grouping and aggregation appears to be a fitting paradigm for various issues in data integration, but in its common form of equality-based grouping, a number of problems remain unsolved. We propose a generic approach to user-defined grouping as part of a SQL extension, allowing for more complex functions, for instance integration of data mining algorithms. Furthermore, we discuss high-level language primitives for common applications.

[1]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[2]  Jaideep Srivastava,et al.  Entity identification in database integration , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[3]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[4]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[5]  Kenneth A. Ross,et al.  Querying Multiple Features of Groups in Relational Databases , 1996, VLDB.

[6]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[7]  Gunter Saake,et al.  Extensible Grouping and Aggregation for Data Reconciliation , 2001, EFIS.

[8]  Kai-Uwe Sattler,et al.  A data preparation framework based on a multidatabase language , 2001, Proceedings 2001 International Database Engineering and Applications Symposium.

[9]  Daniela Florescu,et al.  AJAX: An Extensible Data Cleaning Tool , 2000, SIGMOD Conference.

[10]  T. H. Merrett,et al.  Tries for Approximate String Matching , 1996, IEEE Trans. Knowl. Data Eng..

[11]  Wen-Syan Li Knowledge Gathering and Matching in Heterogeneous Databases t , 1995 .

[12]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[13]  Michael Stonebraker,et al.  Independent, Open Enterprise Data Integration , 1999, IEEE Data Eng. Bull..

[14]  Diego Calvanese,et al.  A Principled Approach to Data Integration and Reconciliation in Data Warehousing , 1999, DMDW.

[15]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[16]  Gunter Saake,et al.  Adding Conflict Resolution Features to a Query Language for Database Federations , 2000, Australas. J. Inf. Syst..

[17]  Roger King,et al.  Using Object Matching and Materialization to Integrate Heterogeneous Databases , 1999, CoopIS.

[18]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[19]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[20]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[21]  Scott B. Huffman,et al.  Heuristic Joins to Integrate Structured Hetrogeneous Data , 1995 .

[22]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[23]  William Kent,et al.  The breakdown of the information model in multi-database systems , 1991, SGMD.

[24]  Arbee L. P. Chen,et al.  A probabilistic approach to query processing in heterogeneous database systems , 1992, [1992 Proceedings] Second International Workshop on Research Issues on Data Engineering: Transaction and Query Processing.

[25]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[26]  Wen-Syan Li Knowledge Gathering and Matching in Heterogeneous Databases , 1995 .

[27]  Carlo Zaniolo,et al.  Using SQL to Build New Aggregates and Extenders for Object- Relational Systems , 2000, VLDB.

[28]  Hector Garcia-Molina,et al.  Duplicate Removal in Information Dissemination , 1998 .