Efficient Monitoring Algorithm for Fast News Alert

Recently, there has been a dramatic increase in the use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by all news feeders puts an enormous load on many sites. In this paper, we propose a blog aggregator approach where a central aggregator monitors and retrieves new postings from different data sources and subsequently disseminates them to the subscribers to alleviate such a problem. We study how the blog aggregator should monitor the data sources to quickly retrieve new postings using minimal resources and to provide its subscribers with fast news alert. Our studies on a collection of 10K RSS feeds show that, with proper resource allocation and scheduling, the blog aggregator provides news 50% faster than the best existing approach and also reduces the load on the monitored data sources by a significant amount.

[1]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[2]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[3]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[4]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[5]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[6]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[7]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[8]  Prashant J. Shenoy,et al.  Adaptive push-pull: disseminating dynamic web data , 2001, WWW '01.

[9]  Dennis Shasha,et al.  Filtering algorithms and implementation for very fast publish/subscribe systems , 2001, SIGMOD '01.

[10]  Avigdor Gal,et al.  Managing periodically updated data in relational databases: a stochastic modeling approach , 2000, JACM.

[11]  Alexandros Labrinidis,et al.  Update Propagation Strategies for Improving the Quality of Data on the Web , 2001, VLDB.

[12]  Edith Cohen,et al.  Refreshment policies for Web content caches , 2002, Comput. Networks.

[13]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[14]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[15]  Jennifer Widom,et al.  Best-effort cache synchronization with source cooperation , 2002, SIGMOD '02.

[16]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[17]  Alexandros Labrinidis,et al.  Balancing Performance and Data Freshness in Web Database Servers , 2003, VLDB.

[18]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[19]  Krithi Ramamritham,et al.  An Efficient and Resilient Approach to Filtering and Disseminating Streaming Data , 2003, VLDB.

[20]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[21]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[22]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.