Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks

Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings for the enterprise's relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of contextual information present in the tuple. We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational databases. We make four major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world. Second, we propose how to derive sentences from such a graph that effectively "describe" the similarity across elements (tokens, attributes, rows) in the two datasets. The embeddings are learned based on such sentences. Third, we propose effective optimization to improve the quality of the learned embeddings and the performance of integration tasks. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform an extensive set of experiments validating them against multiple baseline methods. Our experiments show that our framework, EmbDI, produces meaningful results for data integration tasks such as schema matching and entity resolution both in supervised and unsupervised settings.

[1]  Michael Stonebraker,et al.  Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[2]  Christoph Lofi,et al.  REMA: Graph Embeddings-based Relational Schema Matching , 2020, EDBT/ICDT Workshops.

[3]  Erhard Rahm,et al.  Evolution of the COMA match system , 2011, OM.

[4]  AnHai Doan,et al.  Smurf: Self-Service String Matching Using Random Forests , 2018, Proc. VLDB Endow..

[5]  AnHai Doan,et al.  Data Curation with Deep Learning , 2020, EDBT.

[6]  Raul Castro Fernandez,et al.  Termite: a system for tunneling through heterogeneous data , 2019, aiDM@SIGMOD.

[7]  Oded Shmueli,et al.  Exploiting Latent Information in Relational Databases via Word Embedding and Application to Degrees of Disclosure , 2018, CIDR.

[8]  Michael Günther FREDDY: Fast Word Embeddings in Database Systems , 2018, SIGMOD Conference.

[9]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[10]  Paolo Papotti,et al.  Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms , 2015, Proc. VLDB Endow..

[11]  Yeye He,et al.  Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning , 2019, WWW.

[12]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[13]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[14]  Oded Shmueli,et al.  Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities , 2017, ArXiv.

[15]  Jeffrey Heer,et al.  Principles of Data Wrangling Practical Techniques for Data Preparation , 2017 .

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[18]  Paolo Papotti,et al.  ++Spicy: an OpenSource Tool for Second-Generation Schema Mapping and Data Exchange , 2011, Proc. VLDB Endow..

[19]  Renée J. Miller,et al.  Making Open Data Transparent: Data Discovery on Open Data , 2018, IEEE Data Eng. Bull..

[20]  Mohammad Mahdavi,et al.  CLRL: Feature Engineering for Cross-Language Record Linkage , 2019, EDBT.

[21]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[22]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[23]  Steven Skiena,et al.  HARP: Hierarchical Representation Learning for Networks , 2017, AAAI.

[24]  Oded Shmueli,et al.  Using Word Embedding to Enable Semantic Queries in Relational Databases , 2017, DEEM@SIGMOD.

[25]  Mourad Ouzzani,et al.  Data Curation with Deep Learning [Vision] , 2018 .

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[28]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[29]  Wolfgang Lehner,et al.  RetroLive: Analysis of Relational Retrofitted Word Embeddings , 2020, EDBT.

[30]  Yeye He,et al.  Auto-Join: Joining Tables by Leveraging Transformations , 2017, Proc. VLDB Endow..

[31]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[32]  Jungo Kasai,et al.  Low-resource Deep Entity Resolution with Transfer and Active Learning , 2019, ACL.

[33]  Masatoshi Yoshikawa,et al.  ILOG: Declarative Creation and Manipulation of Object Identifiers , 1990, VLDB.

[34]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[35]  Xu Chu,et al.  Data Cleaning , 2019, Encyclopedia of Big Data Technologies.

[36]  Shafiq R. Joty,et al.  Feature space of DT Featu re space of DS Feature Truncation Feature Standardization , 2018 .

[37]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[38]  Alon Y. Halevy,et al.  Data Integration: After the Teenage Years , 2017, PODS.

[39]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..