A Self-Organizing Search Engine for RSS Syndicated Web Contents

The exponentially growing information published on the Web relies largely on a few major search engines like Google to be brought to the public nowadays. This raises issues such as: 1. how many percents of coverage do these search engines provide for the whole shared contents over the Internet? 2. how easy is it to find less popular contents from the Web through the page ranking system of these search engines? In fact, the increasing dynamics of the information distributed on the Internet challenge the flexibility of these centralized search engines. With the amount of structured and semi-structured data increase on the Internet, self-organizing search engines that are capable of providing sufficient coverage for data that follow certain structures get more and more attractive. In this paper, we propose a self-organizing search engine soSpace for RSS syndicated web data. soSpace is built on structured peer-to-peer technology. It enables indexing and searching of frequently updated web information described by RSS feed. Our experiment results show that it has good scalability as the contents increase. The recall and precision rate of the result are satisfactory as well.

[1]  Doc Searls,et al.  Building with Blogs , 2003 .

[2]  Sandhya Dwarkadas,et al.  On scaling latent semantic indexing for large peer-to-peer systems , 2004, SIGIR '04.

[3]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[4]  Bryan N. Alexander Web 2.0: A New Wave of Innovation for Teaching and Learning? , 2006 .

[5]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[6]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[7]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[10]  Thu D. Nguyen,et al.  Text-Based Content Search and Retrieval in Ad-hoc P2P Communities , 2002, NETWORKING Workshops.

[11]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[12]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[13]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[14]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[15]  Manolis Koubarakis,et al.  Publish/subscribe functionality in IR environments using structured overlay networks , 2005, SIGIR '05.

[16]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[17]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[18]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[19]  Roger Barga,et al.  Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 3-7 April 2006, Atlanta, GA, USA , 2006, ICDE Workshops.

[20]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[21]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[22]  P. Anderson What is Web 2.0? Ideas, technologies and implications for education , 2007 .

[23]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .