DomainNet: Homograph Detection for Data Lake Disambiguation

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top200 precision is 89%.

[1]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[2]  Oktie Hassanzadeh,et al.  Understanding a large corpus of web tables through matching with knowledge bases: an empirical study , 2015, OM.

[3]  Christos Faloutsos,et al.  HCDF: A Hybrid Community Discovery Framework , 2010, SDM.

[4]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[5]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[6]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[7]  Carlos Alberto Heuser,et al.  Evaluating the Use of Social Networks in Author Name Disambiguation in Digital Libraries , 2010, SBBD.

[8]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[9]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[10]  Renée J. Miller,et al.  A Collective, Probabilistic Approach to Schema Mapping Using Diverse Noisy Evidence , 2019, IEEE Transactions on Knowledge and Data Engineering.

[11]  Lise Getoor,et al.  Collective entity resolution in multi-relational familial networks , 2018, Knowledge and Information Systems.

[12]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[13]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[14]  Craig A. Knoblock,et al.  Exploiting Structure within Data for Accurate Labeling using Conditional Random Fields , 2012 .

[15]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[16]  German Rigau,et al.  Robust multilingual Named Entity Recognition with shallow semi-supervised features , 2016, Artif. Intell..

[17]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[18]  Peter Sanders,et al.  Better Approximation of Betweenness Centrality , 2008, ALENEX.

[19]  Divesh Srivastava,et al.  Data-driven domain discovery for structured datasets , 2020, Proc. VLDB Endow..

[20]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[21]  Dominique Ritze,et al.  Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases , 2016, WWW.

[22]  Wolfgang Lehner,et al.  Top-k entity augmentation using consistent set covering , 2015, SSDBM.

[23]  Shinji Nakadai,et al.  Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables , 2019, AAAI.

[24]  W. Tan,et al.  Sato , 2019, Proc. VLDB Endow..

[25]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[26]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[27]  Lise Getoor,et al.  Topic Modeling for Wikipedia Link Disambiguation , 2014, ACM Trans. Inf. Syst..

[28]  Gerhard Weikum,et al.  AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[29]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[30]  Milne Witten Inlink On Embeddings in Relational Databases , 2019 .

[31]  Yoshua Bengio,et al.  Unsupervised Sense Disambiguation Using Bilingual Probabilistic Models , 2004, ACL.

[32]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[33]  Raul Castro Fernandez,et al.  Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[34]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[35]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[36]  George Papadakis,et al.  An Overview of End-to-End Entity Resolution for Big Data , 2020, ACM Comput. Surv..

[37]  George Giannakopoulos,et al.  Three-dimensional Entity Resolution with JedAI , 2020, Inf. Syst..

[38]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[39]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[40]  Evgenios M. Kornaropoulos,et al.  Fast approximation of betweenness centrality through sampling , 2014, Data Mining and Knowledge Discovery.

[41]  Paolo Papotti,et al.  Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks , 2020, SIGMOD Conference.

[42]  Ignacio Iacobacci,et al.  Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[43]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[44]  Chuan Xiao,et al.  Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach , 2020, ArXiv.

[45]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[46]  Christian Staudt,et al.  NetworKit: A tool suite for large-scale complex network analysis , 2014, Network Science.

[47]  Roberto Navigli,et al.  A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation , 2014, CL.

[48]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[49]  Niloy Ganguly,et al.  Metrics for Community Analysis: A Survey , 2016 .

[50]  Wolfgang Lehner,et al.  Publish-time data integration for open data platforms , 2013, WOD '13.

[51]  Christoph Lofi,et al.  REMA: Graph Embeddings-based Relational Schema Matching , 2020, EDBT/ICDT Workshops.

[52]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..