Efficient parallelised search engine based on virtual cluster

Recently, more and more researches have indicated that the personalised and parallelised search engine can provide users with fast and correct information from the internet. Hadoop is a software framework to process the huge dataset with more than petabyte size. Virtualisation technology can fully utilise the resources of physical machines. In this paper, we construct a virtual cluster as a Hadoop cluster by multiple virtual machines to perform multiple Nutch simultaneously. From the experimental results, the proposed virtual cluster architecture for Nutch can retrieval data rapidly and the performance enhancement is proportional to the number of virtual machines.

[1]  Zhang Kui,et al.  Research on Building an Open Access Search Engine with Nutch , 2011 .

[2]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[3]  Maged M. Michael,et al.  Scalability of the Nutch search engine , 2007, ICS '07.

[4]  Adam Rifkin,et al.  Nutch: A Flexible and Scalable Open-Source Web Search Engine , 2005 .

[5]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[6]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[7]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.