Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery

Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.

[1]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[2]  Florent Perronnin,et al.  Aggregating Continuous Word Embeddings for Information Retrieval , 2013, CVSM@ACL.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[5]  Jayant Madhavan,et al.  OpenII: an open source information integration toolkit , 2010, SIGMOD Conference.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Koby Crammer,et al.  Learning to create data-integrating queries , 2008, Proc. VLDB Endow..

[9]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[10]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[11]  Kristina Lerman,et al.  Semi-automatically Mapping Structured Sources into the Semantic Web , 2012, ESWC.

[12]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[13]  Laura M. Haas,et al.  The Clio project: managing heterogeneity , 2001, SGMD.

[14]  Bernardo Cuenca Grau,et al.  LogMap 2.0: towards logic-based, scalable and interactive ontology matching , 2011, SWAT4LS.

[15]  Diego Calvanese,et al.  Ontop: Answering SPARQL queries over relational databases , 2016, Semantic Web.

[16]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[17]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[18]  Ian Horrocks,et al.  BootOX: Practical Mapping of RDBs to OWL 2 , 2015, SEMWEB.

[19]  W. Bruce Croft,et al.  Estimating Embedding Vectors for Queries , 2016, ICTIR.

[20]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[21]  M. de Rijke,et al.  Siamese CBOW: Optimizing Word Embeddings for Sentence Representations , 2016, ACL.

[22]  Craig A. Knoblock,et al.  Learning the Semantics of Structured Data Sources , 2016, J. Web Semant..

[23]  Erhard Rahm,et al.  Evolution of the COMA match system , 2011, OM.

[24]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[25]  Jan Nößner,et al.  CODI: Combinatorial Optimization for Data Integration: results for OAEI 2011 , 2010, OM.

[26]  Eneko Agirre,et al.  Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet , 2016, AAAI.

[27]  Achille Fokoue,et al.  Exploring Big Data with Helix: Finding Needles in a Big Haystack , 2015, SGMD.

[28]  Markus Nentwig,et al.  A survey of current Link Discovery frameworks , 2016, Semantic Web.

[29]  Christopher De Sa,et al.  DeepDive: Declarative Knowledge Base Construction , 2016, SGMD.

[30]  Craig A. Knoblock,et al.  Leveraging Linked Data to Discover Semantic Relations Within Data Sources , 2016, SEMWEB.

[31]  Fausto Giunchiglia,et al.  S-Match: an Algorithm and an Implementation of Semantic Matching , 2004, ESWS.

[32]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[33]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[34]  Surajit Chaudhuri,et al.  Fast Foreign-Key Detection in Microsoft SQL Server PowerPivot for Excel , 2014, Proc. VLDB Endow..

[35]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[36]  Christian Bizer,et al.  DBpedia: A Multilingual Cross-domain Knowledge Base , 2012, LREC.