Sparse relational data sets: issues and an application

This dissertation comprises three parts. The first part presents a relational approach to building a workbench that supports extracting and processing structured data from unstructured data for querying, and that allows users to query using as much structured data as is currently available. The workbench provides basic operations that can be combined to process data in a "pay as you go" fashion, and a wide table to store the resulting sparse data set, in which most attributes are null for most documents. As a proof of concept, we conducted a case study on applying this relational workbench approach to support structured queries over Wikipedia. We present examples of incremental data processing with a series of operations and show that users can pose increasingly sophisticated queries over the results of these operations. Our conclusion from the case study is that while the relational workbench approach is promising and worth investigating, its success heavily relies on good relational database support for sparse data. Unfortunately, most relational database systems are not good at handling sparse data sets. The second part of the dissertation addresses some challenges presented when managing sparse data in relational database systems. With recent work showing that we can store sparse data efficiently by using interpreted storage, we show that storing a sparse data set in a single, wide table is an effective approach. For querying, we show that keyword search often provides "focused" results because most terms appear in few rows and columns. As for query evaluation, we show that sparse B-tree indexes allow a much broader index coverage over a sparse data set than the more common full B-tree indexes; moreover, we describe how to automatically infer from a sparse data set a "hidden schema," which we could use as a reference for building materialized views, a browsing directory, or query forms over the wide table. The last part of the dissertation addresses the problem of building structured queries over a relational database system without knowing SQL and the database schema. We propose to combine keyword search and query forms to try to provide the best of both. We present a systematic approach to generate forms, and propose techniques to handle keyword queries having a mix of database constants and terms that appear on the forms. We consider grouping similar forms returned by keyword search to help users more easily find the right form. Our real-life user study shows that our keyword search implementations are efficient and generally effective, and that grouping forms is an important post-processing step of keyword search to lead users to the right forms.