Sift: an end-user tool for gathering web content on the go

Although web sites have started to embed semantic metadata within their documents, it remains a challenge for non-technical end-users to exploit that markup to extract and store information of interest. To address this challenge, we show how tools can be developed that allow users to identify extractable information while browsing and then control how that information should be extracted and stored in a personal library. The proposed approach is based on an extensible framework capable of using different kinds of markup to aid the extraction process and a unique fusion of several well-established techniques from areas such as the semantic web, data warehousing, web scraping and web feeds. We present the Sift tool which is a proof-of-concept implementation of the approach.

[1]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[2]  Steffen Staab,et al.  Authoring and annotation of web pages in CREAM , 2002, WWW.

[3]  Rafael Berlanga Llavori,et al.  Building data warehouses with semantic data , 2010, EDBT '10.

[4]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  Monica M. C. Schraefel,et al.  Hunter gatherer: interaction support for the creation and management of within-web-page collections , 2002, WWW.

[6]  Witold Abramowicz,et al.  MyPortal: robust extraction and aggregation of web content , 2006, VLDB.

[7]  Rafael Berlanga Llavori,et al.  Integrating web feed opinions into a corporate data warehouse , 2011, BEWEB '11.

[8]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.

[9]  David Salesin,et al.  Relations, cards, and search templates: user-guided web data integration and layout , 2007, UIST.

[10]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[11]  Monica M. C. Schraefel,et al.  Interaction design for Web-based, within-page collection making and management , 2001, Hypertext.

[12]  David C. Yen,et al.  Web warehousing: Web technology meets data warehousing , 2003 .

[13]  David R. Karger,et al.  Piggy Bank: Experience the Semantic Web Inside Your Web Browser , 2005, International Semantic Web Conference.

[14]  Marja-Riitta Koivunen,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2001, WWW '01.

[15]  Moira C. Norrie,et al.  Mix-n-Match: Building Personal Libraries from Web Content , 2012, TPDL.

[16]  Atsushi Sugiura,et al.  Internet scrapbook: automating Web browsing tasks by demonstration , 1998, UIST '98.

[17]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[18]  Michael Hausenblas,et al.  Building Linked Data For Both Humans and Machines , 2008, LDOW.

[19]  David Salesin,et al.  Summarizing personal web browsing sessions , 2006, UIST.