Indexes Analysis for Matching Subscriptions in RSS feeds

The explosion of published information on the Web leads to the emergence of a Web syndication paradigm, which transforms the passive reader into an active information collector. Information consumers subscribe to RSS/Atom feeds and are notified whenever a piece of news (item) is published. The success of this Web syndication now offered on Web sites, blogs, and social media, however raises scalability issues. There is a vital need for efficient real-time filtering methods across feeds, to allow users to follow effectively personally interesting information. We investigate in this paper three indexing techniques for users' subscriptions based on inverted lists or on an ordered trie. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical workload parameters on these structures.

[1]  Philippe Flajolet,et al.  Dynamical Sources in Information Theory : A General Analysis of Trie Structures , 1999 .

[2]  Sergei Vassilvitskii,et al.  Indexing Boolean Expressions , 2009, Proc. VLDB Endow..

[3]  Torsten Suel,et al.  Efficient query subscription processing for prospective search engines , 2006, WWW '06.

[4]  John R. Kender,et al.  Optimizing Frequency Queries for Data Mining Applications , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Ashwin Machanavajjhala,et al.  Scalable ranked publish/subscribe , 2008, Proc. VLDB Endow..

[6]  Marcos K. Aguilera,et al.  Matching events in a content-based subscription system , 1999, PODC '99.

[7]  Ferenc Bodon,et al.  A trie-based APRIORI implementation for mining frequent item sequences , 2005 .

[8]  Alexander L. Wolf,et al.  Forwarding in a content-based network , 2003, SIGCOMM '03.

[9]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[10]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[11]  Masaru Kitsuregawa,et al.  UB-Tree Based Efficient Predicate Index with Dimension Transform for Pub/Sub System , 2004, DASFAA.

[12]  Ferenc Bodon,et al.  Surprising Results of Trie-based FIM Algorithms , 2004, FIMI.

[13]  Jaswinder Pal Singh,et al.  Analysis and algorithms for content-based event matching , 2005, 25th IEEE International Conference on Distributed Computing Systems Workshops.

[14]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[15]  Kenneth Ward Church,et al.  A Data Structure for Sponsored Search , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Hector Garcia-Molina,et al.  Index structures for selective dissemination of information under the Boolean model , 1994, TODS.

[17]  Phil Bagwell,et al.  Ideal Hash Trees , 2001 .

[18]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[19]  Dennis Shasha,et al.  Filtering algorithms and implementation for very fast publish/subscribe systems , 2001, SIGMOD '01.

[20]  Cédric du Mouza,et al.  Characterizing Web Syndication Behavior and Content , 2011, WISE.

[21]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[22]  Dennis Shasha,et al.  Publish/Subscribe on the Web at Extreme Speed , 2000, VLDB.