论文信息 - Deeper: A Data Enrichment System Powered by Deep Web

Deeper: A Data Enrichment System Powered by Deep Web

Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.

Ryan Shea | Pei Wang | Eugene Wu | Jiannan Wang | Yongjun He

[1] Surajit Chaudhuri,et al. Finding Patterns in a Knowledge Base using Keywords to Compose Table Answers , 2014, Proc. VLDB Endow..

[2] Fan Wang,et al. Effective and efficient sampling methods for deep web aggregation queries , 2011, EDBT/ICDT '11.

[3] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[4] Alon Y. Halevy,et al. Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[5] Ryan Shea,et al. SmartCrawl : Deep Web Crawling Driven By Data Enrichment , 2018 .

[6] Meihui Zhang,et al. InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables , 2013, SIGMOD '13.

[7] Gautam Das,et al. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[8] Shaleen Deep,et al. The Design of Arbitrage-Free Data Pricing Schemes , 2016, ICDT.

[9] Ziv Bar-Yossef,et al. Random sampling from a search engine's index , 2006, WWW '06.

[10] Dan Suciu,et al. Data Markets in the Cloud: An Opportunity for the Database Community , 2011, Proc. VLDB Endow..

[11] Dan Suciu,et al. Query-Based Data Pricing , 2015, J. ACM.

[12] Surajit Chaudhuri,et al. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.