论文信息 - DomainNet: Homograph Detection for Data Lake Disambiguation

DomainNet: Homograph Detection for Data Lake Disambiguation

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top200 precision is 89%.

[1] Renée J. Miller,et al. LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[2] Oktie Hassanzadeh,et al. Understanding a large corpus of web tables through matching with knowledge bases: an empirical study , 2015, OM.

[3] Christos Faloutsos,et al. HCDF: A Hybrid Community Discovery Framework , 2010, SDM.

[4] Mark Newman,et al. Networks: An Introduction , 2010 .

[5] Tim Kraska,et al. VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[6] Tim Kraska,et al. Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[7] Carlos Alberto Heuser,et al. Evaluating the Use of Social Networks in Author Name Disambiguation in Digital Libraries , 2010, SBBD.

[8] Dominique Ritze,et al. A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[9] Leonard M. Freeman,et al. A set of measures of centrality based upon betweenness , 1977 .

[10] Renée J. Miller,et al. A Collective, Probabilistic Approach to Schema Mapping Using Diverse Noisy Evidence , 2019, IEEE Transactions on Knowledge and Data Engineering.

[11] Lise Getoor,et al. Collective entity resolution in multi-relational familial networks , 2018, Knowledge and Information Systems.

[12] Ashwin Machanavajjhala,et al. Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[13] U. Brandes. A faster algorithm for betweenness centrality , 2001 .

[14] Craig A. Knoblock,et al. Exploiting Structure within Data for Accurate Labeling using Conditional Random Fields , 2012 .

[15] Duncan J. Watts,et al. Collective dynamics of ‘small-world’ networks , 1998, Nature.

[16] German Rigau,et al. Robust multilingual Named Entity Recognition with shallow semi-supervised features , 2016, Artif. Intell..

[17] Gerhard Weikum,et al. WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[18] Peter Sanders,et al. Better Approximation of Betweenness Centrality , 2008, ALENEX.

[19] Divesh Srivastava,et al. Data-driven domain discovery for structured datasets , 2020, Proc. VLDB Endow..

[20] Avigdor Gal,et al. Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[21] Dominique Ritze,et al. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases , 2016, WWW.

[22] Wolfgang Lehner,et al. Top-k entity augmentation using consistent set covering , 2015, SSDBM.

[23] Shinji Nakadai,et al. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables , 2019, AAAI.

[24] W. Tan,et al. Sato , 2019, Proc. VLDB Endow..

[25] Roberto Navigli,et al. Word sense disambiguation: A survey , 2009, CSUR.

[26] Praveen Paritosh,et al. Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[27] Lise Getoor,et al. Topic Modeling for Wikipedia Link Disambiguation , 2014, ACM Trans. Inf. Syst..

[28] Gerhard Weikum,et al. AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables , 2011, Proc. VLDB Endow..

[29] Santo Fortunato,et al. Community detection in graphs , 2009, ArXiv.

[30] Milne Witten Inlink. On Embeddings in Relational Databases , 2019 .

[31] Yoshua Bengio,et al. Unsupervised Sense Disambiguation Using Bilingual Probabilistic Models , 2004, ACL.

[32] Renée J. Miller,et al. Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[33] Raul Castro Fernandez,et al. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[34] Renée J. Miller,et al. Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[35] Sonia Bergamaschi,et al. BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[36] George Papadakis,et al. An Overview of End-to-End Entity Resolution for Big Data , 2020, ACM Comput. Surv..

[37] George Giannakopoulos,et al. Three-dimensional Entity Resolution with JedAI , 2020, Inf. Syst..

[38] Lise Getoor,et al. Collective entity resolution in relational data , 2007, TKDD.

[39] Marcos André Gonçalves,et al. A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[40] Evgenios M. Kornaropoulos,et al. Fast approximation of betweenness centrality through sampling , 2014, Data Mining and Knowledge Discovery.

[41] Paolo Papotti,et al. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks , 2020, SIGMOD Conference.

[42] Ignacio Iacobacci,et al. Embeddings for Word Sense Disambiguation: An Evaluation Study , 2016, ACL.

[43] Steven Bethard,et al. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[44] Chuan Xiao,et al. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach , 2020, ArXiv.

[45] Jens Lehmann,et al. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[46] Christian Staudt,et al. NetworKit: A tool suite for large-scale complex network analysis , 2014, Network Science.

[47] Roberto Navigli,et al. A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation , 2014, CL.

[48] Sunita Sarawagi,et al. Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[49] Niloy Ganguly,et al. Metrics for Community Analysis: A Survey , 2016 .

[50] Wolfgang Lehner,et al. Publish-time data integration for open data platforms , 2013, WOD '13.

[51] Christoph Lofi,et al. REMA: Graph Embeddings-based Relational Schema Matching , 2020, EDBT/ICDT Workshops.

[52] Neil R. Smalheiser,et al. Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..