Exploratory Ad-Hoc Analytics for Big Data

In a traditional relational database management system, queries can only be defined over attributes defined in the schema, but are guaranteed to give single, definitive answer structured exactly as specified in the query. In contrast, an information retrieval system allows the user to pose queries without knowledge of a schema, but the result will be a top-k list of possible answers, with no guarantees about the structure or content of the retrieved documents. In this chapter, we present Drill Beyond, a novel IR/RDBMS hybrid system, in which the user seamlessly queries a relational database together with a large corpus of tables extracted from a web crawl. The system allows full SQL queries over a relational database, but additionally enables the user to use arbitrary additional attributes in the query that need not to be defined in the schema. The system then processes this semi-specified query by computing a top-k list of possible query evaluations, each based on different candidate web data sources, thus mixing properties of two worlds RDBMS and IR systems.

[1]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[2]  Gottfried Vossen,et al.  Towards Self-Service Business Intelligence , 2013 .

[3]  John Domingue,et al.  The Web of Data: Bridging the Skills Gap , 2014, IEEE Intelligent Systems.

[4]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[5]  Hye-Young Paik,et al.  Data integration in mashups , 2009, SGMD.

[6]  Volker Markl,et al.  Damia: data mashups for intranet applications , 2008, SIGMOD Conference.

[7]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[8]  Volker Markl,et al.  Situational Business Intelligence , 2008, BIRTE.

[9]  Daniel E. O'Leary,et al.  Embedding AI and Crowdsourcing in the Big Data Lake , 2014, IEEE Intelligent Systems.

[10]  Michael Stonebraker,et al.  DataXFormer: An Interactive Data Transformation Tool , 2015, SIGMOD Conference.

[11]  Guanghui Lan,et al.  An effective and simple heuristic for the set covering problem , 2007, Eur. J. Oper. Res..

[12]  Ihab F. Ilyas,et al.  Expressive and flexible access to web-extracted data: a keyword-based structured query language , 2010, SIGMOD Conference.

[13]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[14]  Wolfgang Lehner,et al.  Top-k entity augmentation using consistent set covering , 2015, SSDBM.

[15]  Celso C. Ribeiro,et al.  Multi-start methods for combinatorial optimization , 2013, Eur. J. Oper. Res..

[16]  Margaret-Anne D. Storey,et al.  A Survey of Mashup Development Environments , 2010, The Smart Internet.

[17]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[18]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[19]  Celso C. Ribeiro,et al.  Greedy Randomized Adaptive Search Procedures , 2003, Handbook of Metaheuristics.

[20]  Meihui Zhang,et al.  InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables , 2013, SIGMOD '13.

[21]  Peter Fankhauser,et al.  DivQ: diversification for keyword search over structured databases , 2010, SIGIR.

[22]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[23]  Torben Bach Pedersen,et al.  Using Semantic Web Technologies for Exploratory OLAP: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[24]  Heiko Paulheim,et al.  The Mannheim Search Join Engine , 2015, J. Web Semant..

[25]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[26]  Paul Dixon,et al.  Basics of Oracle Text Retrieval , 2001, IEEE Data Eng. Bull..

[27]  Gerhard Weikum DB&IR: both sides now , 2007, SIGMOD '07.

[28]  Wolfgang Lehner,et al.  DrillBeyond: processing multi-result open world SQL queries , 2015, SSDBM.

[29]  Mohand Boughanem,et al.  A survey on tree matching and XML retrieval , 2013, Comput. Sci. Rev..

[30]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[31]  Wolfgang Lehner,et al.  DrillBeyond: Enabling Business Analysts to Explore the Web of Open Data , 2012, Proc. VLDB Endow..

[32]  Eamonn J. Keogh,et al.  Diversifying query results on semi-structured data , 2012, CIKM '12.

[33]  Joaquín Bautista,et al.  A GRASP algorithm to solve the unicost set covering problem , 2007, Comput. Oper. Res..

[34]  Sihem Amer-Yahia,et al.  Report on the DB/IR panel at SIGMOD 2005 , 2005, SGMD.

[35]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[36]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[37]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[38]  Karl Aberer,et al.  Result selection and summarization for Web Table search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[39]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[40]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[41]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[42]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[43]  Sunita Sarawagi,et al.  Open-domain quantity queries on web tables: annotation, response, and consensus models , 2014, KDD.

[44]  Susan T. Dumais,et al.  From x-rays to silly putty via Uranus: serendipity and its role in web search , 2009, CHI.

[45]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[46]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[47]  Jing Liu,et al.  Answering Structured Queries on Unstructured Data , 2006, WebDB.

[48]  Oren Etzioni,et al.  Structured querying of web text , 2007 .

[49]  Reinaldo J. Moraga,et al.  Meta-RaPS: a simple and effective approach for solving the traveling salesman problem , 2005 .

[50]  Andreas Thor,et al.  Data Integration Support for Mashups , 2007 .

[51]  Ashwin Machanavajjhala,et al.  An Analysis of Structured Data on the Web , 2012, Proc. VLDB Endow..

[52]  Andreas Thor,et al.  WETSUIT: An Efficient Mashup Tool for Searching and Fusing Web Entities , 2012, Proc. VLDB Endow..

[53]  Lorena Etcheverry,et al.  Enhancing OLAP Analysis with Web Cubes , 2012, ESWC.

[54]  Andreas Thor,et al.  Entity Search Strategies for Mashup Applications , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[55]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.