论文信息 - Table Union Search on Open Data

Table Union Search on Open Data

We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are unionable if they share attributes from the same domain. Our solution formalizes three statistical models that describe how unionable attributes are generated from set domains, semantic domains with values from an ontology, and natural language domains. We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show that our table union search outperforms in speed and accuracy existing algorithms for finding related tables and scales to provide efficient search over Open Data repositories containing more than one million attributes.

[1] Alon Y. Halevy,et al. Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[2] Philip A. Bernstein,et al. HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching , 2009, Proc. VLDB Endow..

[3] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5] Reynold Xin,et al. Finding related tables , 2012, SIGMOD Conference.

[6] Xiaohua Hu,et al. MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup , 2006, PRICAI.

[7] Olivier Bodenreider,et al. Comparing two approaches for aligning representations of anatomy , 2007, Artif. Intell. Medicine.

[8] Jayant Madhavan,et al. Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[9] Jens Lehmann,et al. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[10] H. Hotelling. The Generalization of Student’s Ratio , 1931 .

[11] Anand Rajaraman,et al. Mining of Massive Datasets , 2011 .

[12] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[13] Stephen E. Robertson,et al. A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[14] Gerhard Weikum,et al. WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[15] Jeffrey F. Naughton,et al. On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[16] Peter Willett,et al. Estimating the recall performance of Web search engines , 1997 .

[17] Dominique Ritze,et al. Matching HTML Tables to DBpedia , 2015, WIMS.

[18] Renée J. Miller,et al. LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[19] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20] Mayank Bawa,et al. LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[21] W. Bruce Croft,et al. Estimating Embedding Vectors for Queries , 2016, ICTIR.

[22] Daisy Zhe Wang,et al. WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[23] Sunita Sarawagi,et al. Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[24] Renée J. Miller,et al. Discovering Linkage Points over Web Data , 2013, Proc. VLDB Endow..