Architecture of a grid-enabled Web search engine

Search Engine for South-East Europe (SE4SEE) is a socio-cultural search engine running on the grid infrastructure. It offers a personalized, on-demand, country-specific, category-based Web search facility. The main goal of SE4SEE is to attack the page freshness problem by performing the search on the original pages residing on the Web, rather than on the previously fetched copies as done in the traditional search engines. SE4SEE also aims to obtain high download rates in Web crawling by making use of the geographically distributed nature of the grid. In this work, we present the architectural design issues and implementation details of this search engine. We conduct various experiments to illustrate performance results obtained on a grid infrastructure and justify the use of the search strategy employed in SE4SEE.

[1]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[2]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[3]  Ricardo A. Baeza-Yates,et al.  Crawling a country: better strategies than breadth-first for web page ordering , 2005, WWW '05.

[4]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[5]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Berkant Barla Cambazoglu,et al.  Data-Parallel Web Crawling Models , 2004, ISCIS.

[8]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[9]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[10]  Cevdet Aykanat,et al.  Harbinger Machine Learning Toolkit Manual ⋆ , 2005 .

[11]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[12]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[13]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[14]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[15]  Dick Wilson,et al.  Hong Kong! Hong Kong! , 1990 .

[16]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[17]  Özgür Ulusoy,et al.  Exploiting interclass rules for focused crawling , 2004, IEEE Intelligent Systems.

[18]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[19]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[20]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[22]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[23]  Donna Harman,et al.  Information Processing and Management , 2022 .

[24]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[25]  Berkant Barla Cambazoglu,et al.  Performance of query processing implementations in ranking-based text retrieval systems using inverted indices , 2006, Inf. Process. Manag..

[26]  Ismail Sengör Altingövde,et al.  Efficiency and effectiveness of query processing in cluster-based retrieval , 2004, Inf. Syst..

[27]  Min-Yen Kan,et al.  Web Page Categorization without the Web Page , 2004 .

[28]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[29]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[30]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[31]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[32]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[33]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[34]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[35]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[36]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[37]  Qi Lu,et al.  Collaborative Web crawling: information gathering/processing over Internet , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[38]  Charles L. A. Clarke,et al.  Relevance ranking for one to three term queries , 1997, Inf. Process. Manag..

[39]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[40]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[41]  Gerhard Weikum,et al.  Improving Collection Selection with Overlap-Awareness , 2005 .

[42]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[43]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[44]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[45]  Jens Vigen,et al.  Project GRACE: A grid based search tool for the global digital library , 2004 .

[46]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[47]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[48]  Marios D. Dikaiakos,et al.  Design and Implementation of a Distributed Crawler and Filtering Processor , 2002, NGITS.

[49]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[50]  Alistair Moffat,et al.  Memory Efficient Ranking , 1994, Inf. Process. Manag..