A distributed full-text top-k document dissemination system in distributed hash tables

Recent years witnessed the explosive growth of ‘live’ web content in the World Wide Web like Weblogs, RSS feeds, and real-time news, etc. The popular usage of RSS feeds/readers enables end users to subscribe for favorite contents via input RSS URLs. However, the RSS feeds/readers architecture suffers from (i) the high bandwidth consumption issue, and (ii) limited filtering semantics. In this paper, we proposed a stateful full text dissemination scheme over structured P2Ps to address both issues. Specifically, for the semantic side, end users are allowed to subscribe for favorite contents via input keywords; for the network bandwidth side, the cooperative content polling, filtering and disseminating via DHT-based P2P overlay networks save the network bandwidth consumption. Our contributions include the novel techniques to (i) reduce the unit-publishing cost by pruning irreverent documents during the forwarding path towards destinations, and (ii) reduce the publication amount by selecting a very small number of meaningful terms. Based on real data sets, our experimental results show that the proposed scheme can significantly reduce the publishing cost with low maintenance overhead and a high document quality.

[1]  Zhichen Xu,et al.  pFilter: global information filtering and dissemination using structured overlay networks , 2003, The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems, 2003. FTDCS 2003. Proceedings..

[2]  Manolis Koubarakis,et al.  Information filtering and query indexing for an information retrieval model , 2009, TOIS.

[3]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[4]  Dennis Shasha,et al.  Filtering algorithms and implementation for very fast publish/subscribe systems , 2001, SIGMOD '01.

[5]  Ioana Manolescu,et al.  XML processing in DHT networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[6]  Ben Y. Zhao,et al.  Tapestry: a fault-tolerant wide-area application infrastructure , 2002, CCRV.

[7]  Hanhua Chen,et al.  On Efficient Content Matching in Distributed Pub/Sub Systems , 2009, IEEE INFOCOM 2009.

[8]  James P. Callan,et al.  Document filtering with inference networks , 1996, SIGIR '96.

[9]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[10]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[11]  Heng Tao Shen,et al.  A Novel Content Distribution Mechanism in DHT Networks , 2009, Networking.

[12]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[13]  Yong Yang,et al.  Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[14]  Divyakant Agrawal,et al.  Meghdoot: Content-Based Publish/Subscribe over P2P Networks , 2004, Middleware.

[15]  Hanhua Chen,et al.  STAIRS: Towards Efficient Full-Text Filtering and Dissemination in a DHT Environment , 2009, ICDE.

[16]  Sandhya Dwarkadas,et al.  Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval , 2004, NSDI.

[17]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[18]  Evaggelia Pitoura,et al.  Cooperative XPath caching , 2008, SIGMOD Conference.

[19]  Peter Druschel,et al.  FeedTree: Sharing Web Micronews with Peer-to-Peer Event Notification , 2005, IPTPS.

[20]  Lei Chen,et al.  Optimal Resource Placement in Structured Peer-to-Peer Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[21]  Lei Chen,et al.  Optimal proactive caching in peer-to-peer network: analysis and application , 2007, CIKM '07.

[22]  Alexandros Ntoulas,et al.  Answering bounded continuous search queries in the world wide web , 2007, WWW '07.

[23]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[24]  Roberto Baldoni,et al.  Content-Based Publish-Subscribe over Structured Overlay Networks , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[25]  Matt Welsh,et al.  Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds , 2007, NSDI.

[26]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[27]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[28]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[29]  Praveen Yalagandula,et al.  A scalable distributed information management system , 2004, SIGCOMM 2004.

[30]  Guruduth Banavar,et al.  An efficient multicast protocol for content-based publish-subscribe systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[31]  David S. Rosenblum,et al.  Design and evaluation of a wide-area event notification service , 2001, TOCS.