Table Union Search on Open Data

We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are unionable if they share attributes from the same domain. Our solution formalizes three statistical models that describe how unionable attributes are generated from set domains, semantic domains with values from an ontology, and natural language domains. We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show that our table union search outperforms in speed and accuracy existing algorithms for finding related tables and scales to provide efficient search over Open Data repositories containing more than one million attributes.

[1]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[2]  Philip A. Bernstein,et al.  HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching , 2009, Proc. VLDB Endow..

[3]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[6]  Xiaohua Hu,et al.  MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup , 2006, PRICAI.

[7]  Olivier Bodenreider,et al.  Comparing two approaches for aligning representations of anatomy , 2007, Artif. Intell. Medicine.

[8]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[9]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[10]  H. Hotelling The Generalization of Student’s Ratio , 1931 .

[11]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[12]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[13]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[14]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[15]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[16]  Peter Willett,et al.  Estimating the recall performance of Web search engines , 1997 .

[17]  Dominique Ritze,et al.  Matching HTML Tables to DBpedia , 2015, WIMS.

[18]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[21]  W. Bruce Croft,et al.  Estimating Embedding Vectors for Queries , 2016, ICTIR.

[22]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[23]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[24]  Renée J. Miller,et al.  Discovering Linkage Points over Web Data , 2013, Proc. VLDB Endow..

[25]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[26]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[27]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[28]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[29]  Christian Bizer,et al.  Stitching Web Tables for Improving Matching Quality , 2017, Proc. VLDB Endow..

[30]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[33]  Oren Kurland,et al.  Query Expansion Using Word Embeddings , 2016, CIKM.

[34]  Ann Lehman JMP for basic univariate and multivariate statistics : a step-by-step guide , 2005 .

[35]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[36]  P. Patel-Schneider Towards Large-scale Schema And Ontology Matching , 2015 .

[37]  Alon Y. Halevy,et al.  Synthesizing Union Tables from the Web , 2013, IJCAI.

[38]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[39]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[40]  Weifeng Su,et al.  Holistic Schema Matching for Web Query Interfaces , 2006, EDBT.

[41]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.