In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this paper, we present DrillBeyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over the relational database, but additionally allows the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of RDBMS and IR systems. We design a novel plan operator that encapsulates a web data retrieval and matching system and allows direct integration of such systems into relational query processing. We then present methods for efficiently processing multiple variants of a query, by producing plans that are optimized for large invariant intermediate results that can be reused between multiple query evaluations. We demonstrate the viability of the operator and our optimization strategies by implementing them in PostgreSQL and evaluating on a standard benchmark by adding arbitrary attributes to its queries.
[1]
Wolfgang Lehner,et al.
DrillBeyond: Enabling Business Analysts to Explore the Web of Open Data
,
2012,
Proc. VLDB Endow..
[2]
Alon Y. Halevy,et al.
An adaptive query execution system for data integration
,
1999,
SIGMOD '99.
[3]
Sunita Sarawagi,et al.
Answering Table Queries on the Web using Column Keywords
,
2012,
Proc. VLDB Endow..
[4]
Peter J. Haas,et al.
MCDB: a monte carlo approach to managing uncertain data
,
2008,
SIGMOD Conference.
[5]
Alon Y. Halevy,et al.
Data Integration for the Relational Web
,
2009,
Proc. VLDB Endow..
[6]
Tim Kraska,et al.
CrowdDB: answering queries with crowdsourcing
,
2011,
SIGMOD '11.
[7]
J. S. Saini,et al.
Adaptive Query Processing
,
2006
.
[8]
Meihui Zhang,et al.
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables
,
2013,
SIGMOD '13.
[9]
Surajit Chaudhuri,et al.
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
,
2012,
SIGMOD Conference.
[10]
Wolfgang Lehner,et al.
Top-k entity augmentation using consistent set covering
,
2015,
SSDBM.
[11]
Wolfgang Lehner,et al.
Efficient exploitation of similar subexpressions for query processing
,
2007,
SIGMOD '07.
[12]
Chao Liu,et al.
FACTO: a fact lookup engine based on web tables
,
2011,
WWW.