Valentine: Evaluating Matching Techniques for Dataset Discovery

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics.In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

[1]  Erhard Rahm,et al.  Evolution of the COMA match system , 2011, OM.

[2]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[3]  Wang Chiew Tan,et al.  STBenchmark: towards a benchmark for mapping systems , 2008, Proc. VLDB Endow..

[4]  Renée J. Miller,et al.  JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes , 2019, SIGMOD Conference.

[5]  Guoliang Li,et al.  Fast-join: An efficient method for fuzzy token matching based string similarity join , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[6]  Guoliang Li,et al.  Human-in-the-loop Data Integration , 2017, Proc. VLDB Endow..

[7]  Arnon Rosenthal,et al.  eTuner: tuning schema matching software using synthetic scenarios , 2007, The VLDB Journal.

[8]  Tim Kraska,et al.  ARDA , 2020, Proc. VLDB Endow..

[9]  Krisztian Balog,et al.  EntiTables: Smart Assistance for Entity-Focused Tables , 2017, SIGIR.

[10]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[11]  Christian Bizer,et al.  Stitching Web Tables for Improving Matching Quality , 2017, Proc. VLDB Endow..

[12]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  RahmErhard,et al.  A survey of approaches to automatic schema matching , 2001, VLDB 2001.

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Paolo Papotti,et al.  Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks , 2020, SIGMOD Conference.

[16]  Zachary G. Ives,et al.  Finding Related Tables in Data Lakes for Interactive Data Science , 2020, SIGMOD Conference.

[17]  Michael Stonebraker,et al.  Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[18]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Zohra Bellahsene,et al.  XBenchMatch: a Benchmark for XML Schema Matching Tools , 2007, VLDB.

[21]  Erhard Rahm,et al.  Comparison of Schema Matching Evaluations , 2002, Web, Web-Services, and Database Systems.

[22]  Avigdor Gal,et al.  Uncertain Schema Matching , 2011, Uncertain Schema Matching.

[23]  Norman W. Paton,et al.  Dataset Discovery in Data Lakes , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[24]  Heiner Stuckenschmidt,et al.  Ontology Alignment Evaluation Initiative: Six Years of Experience , 2011, J. Data Semant..

[25]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[26]  Raul Castro Fernandez,et al.  Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[27]  Meihui Zhang,et al.  InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables , 2013, SIGMOD '13.

[28]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[29]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[30]  Sabine Maßmann,et al.  Instance Matching with COMA++ , 2007, BTW Workshops.

[31]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[32]  Tilmann Rabl,et al.  TPC-DI: The First Industry Benchmark for Data Integration , 2014, Proc. VLDB Endow..

[33]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[34]  Renée J. Miller,et al.  The iBench Integration Metadata Generator , 2015, Proc. VLDB Endow..

[35]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[36]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[37]  Jérôme Euzenat,et al.  Ontology matching benchmarks: Generation, stability, and discriminability , 2013, J. Web Semant..

[38]  Beng Chin Ooi,et al.  Automatic discovery of attributes in relational databases , 2011, SIGMOD '11.

[39]  Theodoros Rekatsinas,et al.  Data Integration and Machine Learning: A Natural Synergy , 2018, Proc. VLDB Endow..