Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds

Web feeds allow users to retrieve new content from pages on the World Wide Web. Feeds are offered by a multitude of web pages, ranging from conventional news sites to pages with user generated content such as wikis, forums or personal blogs. They notify interested readers of new content and are therefore interesting for information retrieval tasks. Unfortunately, there is no comprehensive dataset of feeds publicly available, making it difficult for researchers to work with this kind of data and, more importantly, to compare their research results by using a common dataset. In this work we present an extensive real-world dataset of 200,000 diversified feeds, as well as an analysis thereof. The dataset has been collected for a time span of four weeks, yielding over 54 million entries and 100 GB of compressed data. One important outcome of the analysis is, that feeds show different activity patterns that should be considered by aggregators, such as feed reader software, to improve polling strategies. The dataset has been made publicly available for use by research communities around the world.

[1]  Emin Gün Sirer,et al.  Client behavior and feed characteristics of RSS, a publish-subscribe system for web micronews , 2005, IMC '05.

[2]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[3]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[4]  Louiqa Raschid,et al.  Adaptive pull-based policies for wide area data delivery , 2006, TODS.

[5]  Alexander Schill,et al.  Causal Relation Detection for Activities from Heterogeneous Sources , 2011, ICWE Workshops.

[6]  James A. Thom,et al.  Entity Extraction from the Web with WebKnox , 2010 .

[7]  Christos Bouras,et al.  Efficient extraction of news articles based on RSS crawling , 2010, 2010 International Conference on Machine and Web Intelligence.

[8]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[9]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10]  Mark Nottingham,et al.  The Atom Syndication Format , 2005, RFC.

[11]  Mark Liberman,et al.  Corpora for topic detection and tracking , 2002 .

[12]  Matt Welsh,et al.  Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds , 2007, NSDI.

[13]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[14]  Sang Ho Lee,et al.  A new aggregation policy for RSS services , 2008, CSSSIA '08.

[15]  Alexander Schill,et al.  An Optimized Web Feed Aggregation Approach for Generic Feed Types , 2011, ICWSM.