论文信息 - Efficient Semantically Equal Join on Strings in Practice

Efficient Semantically Equal Join on Strings in Practice

In general, data integration begins with schema integration and must be followed by detailed data instances resemblance in order to reach data representation unification. In this paper, we address the limits of the data-level reconciliation automation process in cases where the compared data is semantically equivalent, but the data representation of the values of given attributes is different. We assume that a semantic relationship between potentially used terms is established by a human expert prior to the designed computations, and is represented in an auxiliary table built and maintained for each attribute. On such a context, we introduce a notion of semantically equal join (SEJ), which is the join operation based on a pre-defined semantic relationship. Our goal is to propose a solution for SEJ that can be supported by standard SQL. The paper begins with an illustration of the approach for a single attribute join, followed by a generalisation of SEJ for any number of join attributes. The paper continues with performance considerations for the invented method. Finally, this aspect is supported by extensive experimentation based on our implementation of SEJ executed on synthetic datasets.

Maria E. Orlowska | Juggapong Natwichai | Xingzhi Sun

[1] Julian R. Ullmann,et al. A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[2] Luis Gravano,et al. Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[3] Alon Y. Halevy,et al. Enterprise information integration: successes, challenges and controversies , 2005, SIGMOD '05.

[4] Divesh Srivastava,et al. Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[5] Maria E. Orlowska,et al. Interoperability in information systems , 1995, Inf. Syst. J..

[6] Luis Gravano,et al. Text joins in an RDBMS for web data integration , 2003, WWW '03.

[7] M. W. Orlowski. On Optimisation of Joins in Distributed Database System , 1992, Future Databases.

[8] Erhard Rahm,et al. A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[9] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.