LakeBench: Benchmarks for Data Discovery over Data Lakes

Within enterprises, there is a growing need to intelligently navigate data lakes, specifically focusing on data discovery. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. There is a dearth of benchmarks for these tasks in the public domain, with related work targeting private datasets. In LakeBench, we develop multiple benchmarks for these tasks by using the tables that are drawn from a diverse set of data sources such as government data from CKAN, Socrata, and the European Central Bank. We compare the performance of 4 publicly available tabular foundational models on these tasks. None of the existing models had been trained on the data discovery tasks that we developed for this benchmark; not surprisingly, their performance shows significant room for improvement. The results suggest that the establishment of such benchmarks may be useful to the community to build tabular models usable for data discovery in data lakes.

[1]  Paolo Papotti,et al.  Transformers for Tabular Data Representation: A Survey of Models and Applications , 2023, TACL.

[2]  Renée J. Miller,et al.  Integrating Data Lake Tables , 2022, Proc. VLDB Endow..

[3]  Renée J. Miller,et al.  SANTOS: Relationship-based Semantic Table Union Search , 2022, Proc. ACM Manag. Data.

[4]  Qian Liu,et al.  TAPEX: Table Pre-training via Learning a Neural SQL Executor , 2021, ICLR.

[5]  M. Jarke,et al.  Data Lakes: A Survey of Functions and Systems , 2021, IEEE Transactions on Knowledge and Data Engineering.

[6]  H. Iida,et al.  TABBIE: Pretrained Representations of Tabular Data , 2021, NAACL.

[7]  Pedro A. Szekely,et al.  Retrieving Complex Tables with Multi-Granular Graph Representation Learning , 2021, SIGIR.

[8]  W. Tan,et al.  Annotating Columns with Pre-trained Language Models , 2021, SIGMOD Conference.

[9]  Dongmei Zhang,et al.  TUTA: Tree-based Transformers for Generally Structured Table Pre-training , 2020, KDD.

[10]  Christoph Lofi,et al.  Valentine: Evaluating Matching Techniques for Dataset Discovery , 2020, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[11]  Dragomir R. Radev,et al.  GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing , 2020, ICLR.

[12]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[14]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[15]  W. Tan,et al.  Deep entity matching with pre-trained language models , 2020, Proc. VLDB Endow..

[16]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[17]  Krisztian Balog,et al.  Web Table Extraction, Retrieval and Augmentation , 2019, SIGIR.

[18]  Renée J. Miller,et al.  JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes , 2019, SIGMOD Conference.

[19]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[20]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[21]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[22]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[23]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[24]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[25]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[26]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[27]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[28]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[29]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[30]  Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2022, co-located with the 21st International Semantic Web Conference, ISWC 2022, Virtual conference, October 23-27, 2022 , 2023, SemTab@ISWC.

[31]  Paolo Papotti Technical Perspective of TURL , 2022, SIGMOD Rec..

[32]  SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021 , 2021, SIGIR.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .