Deeper: A Data Enrichment System Powered by Deep Web

Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.