Efficient Semantically Equal Join on Strings in Practice

In general, data integration begins with schema integration and must be followed by detailed data instances resemblance in order to reach data representation unification. In this paper, we address the limits of the data-level reconciliation automation process in cases where the compared data is semantically equivalent, but the data representation of the values of given attributes is different. We assume that a semantic relationship between potentially used terms is established by a human expert prior to the designed computations, and is represented in an auxiliary table built and maintained for each attribute. On such a context, we introduce a notion of semantically equal join (SEJ), which is the join operation based on a pre-defined semantic relationship. Our goal is to propose a solution for SEJ that can be supported by standard SQL. The paper begins with an illustration of the approach for a single attribute join, followed by a generalisation of SEJ for any number of join attributes. The paper continues with performance considerations for the invented method. Finally, this aspect is supported by extensive experimentation based on our implementation of SEJ executed on synthetic datasets.