Maintaining dynamic channel profiles on the web

This work addresses a novel problem of maintaining channel proflies on the Web. Such channel maintenance is essential for next generation of Web 2.0 applications that provide sophisticated search and discovery services over Web information channels. Maintaining a fresh channel profile is extremely difficult due to the the dynamic nature of the channel, especially under the constraint of a limited monitoring budget. We propose a novel monitoring scheme that learns the channels' monitoring rates. The monitoring scheme is further extended to consider the content that is published on the channels. We describe a novelty detection filter that refines the monitoring rate according to the expected rate of novel content published on the channels. We further show how inter-channel profile similarities can be utilized to refine the channel monitoring rates. Using real-world data of Web feeds we study the performance of the monitoring scheme. We experiment with several monitoring policies over a large set of Web feeds and show that a policy based on learning the monitoring rate of the channels, combined with novelty detection, outperforms alternative channel monitoring policies. Our results show that the suggested content-based policy is able to maintain high quality channel profiles under limited monitoring resources.

[1]  Johannes Gehrke,et al.  Cayuga: a high-performance event processing engine , 2007, SIGMOD '07.

[2]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[3]  Song Liu,et al.  Load shedding in stream databases: a control-based approach , 2006, VLDB.

[4]  Emin Gün Sirer,et al.  Client behavior and feed characteristics of RSS, a publish-subscribe system for web micronews , 2005, IMC '05.

[5]  Louiqa Raschid,et al.  Satisfying Complex Data Needs using Pull-Based Online Monitoring of Volatile Data Sources , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[6]  Sandeep Pandey,et al.  WIC: A General-Purpose Algorithm for Monitoring Web Information Sources , 2004, VLDB.

[7]  Avigdor Gal,et al.  Monitoring an Information Source Under a Politeness Constraint , 2008, INFORMS J. Comput..

[8]  David Carmel,et al.  Juru at TREC 2006: TAAT versus DAAT in the Terabyte Track , 2006, TREC.

[9]  Krithi Ramamritham,et al.  Web-CAM: monitoring the dynamic Web to respond to continual queries , 2004, SIGMOD '04.

[10]  Dragomir R. Radev,et al.  NewsInEssence: summarizing online news topics , 2005, Commun. ACM.

[11]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[12]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[13]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[14]  Louiqa Raschid,et al.  Adaptive pull-based policies for wide area data delivery , 2006, TODS.

[15]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[16]  Louiqa Raschid,et al.  Capturing Approximated Data Delivery Tradeoffs , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Eitan Farchi,et al.  Automatic query wefinement using lexical affinities with maximal information gain , 2002, SIGIR '02.

[19]  Susan T. Dumais,et al.  Newsjunkie: providing personalized newsfeeds via analysis of information novelty , 2004, WWW '04.

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Haggai Roitman Profile-Based Online Data Delivery , 2006, OTM Workshops.

[22]  Yun Chi,et al.  Monitoring RSS Feeds Based on User Browsing Pattern , 2007, ICWSM.

[23]  Kannan M. Moudgalya,et al.  Adaptive coherency maintenance techniques for time-varying data , 2003, RTSS 2003. 24th IEEE Real-Time Systems Symposium, 2003.

[24]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[25]  Yi Zhang,et al.  Novelty and redundancy detection in adaptive filtering , 2002, SIGIR '02.

[26]  David Buttler,et al.  Tracking multiple topics for finding interesting articles , 2007, KDD '07.

[27]  Avigdor Gal,et al.  Managing periodically updated data in relational databases: a stochastic modeling approach , 2000, JACM.

[28]  Philip S. Yu,et al.  Resource-adaptive real-time new event detection , 2007, SIGMOD '07.

[29]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.