Inverted Index Partitioning Strategies for a Distributed Search Engine

One of the greatest challenges in information retrieval is to develop an intelligent system for user and machine interaction that supports users in their quest for relevant information. The dramatic increase in the amount of Web content gives rise to the need for a largescale distributed information retrieval system, targeted to support millions of users and terabytes of data. To retrieve information from such a large amount of data in an efficient manner, the index is split among the servers in a distributed information retrieval system. Thus, partitioning the index among these collaborating nodes plays an important role in enhancing the performance of a distributed search engine. The two widely known inverted index partitioning schemes for a distributed information retrieval system are document partitioning and term partitioning. In this thesis, we introduce the Document over Term inverted index distribution scheme, which splits a set of nodes into several groups (sub-clusters) and then performs document partitioning between the groups and term partitioning within the group. As this approach is based on the term and document index partitioning approaches, we also refer it as a Hybrid Inverted Index. This approach retains the disk access benefits of term partitioning and the benefits of sharing computational load, scalability, maintainability, and availability of the document partitioning. We also introduce the Document over Document index partitioning scheme, based on the document partitioning approach. In this approach, a set of nodes is split into groups and documents in the collection are partitioned between groups and also within each group. This strategy retains all the benefits of the document partitioning approach, but reduces the computational load more effectively and uses resources more efficiently. We compare distributed index approaches experimentally and show that in terms of efficiency and scalability, document partition based approaches perform significantly better than the others. The Document over Term partitioning offers efficient utilization of search-servers and lowers disk access, but suffers from the problem of load imbalance. The Document over Document partitioning emerged to be the preferred method during high workload.

[1]  Berkant Barla Cambazoglu,et al.  Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems , 2006, ISCIS.

[2]  Alistair Moffat,et al.  Methodologies for distributed information retrieval , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[3]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[4]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007 .

[5]  Lei Chen,et al.  Collaborative Search in Large-scale Unstructured Peer-to-Peer Networks , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[6]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[7]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[8]  Ricardo A. Baeza-Yates,et al.  Challenges on Distributed Web Retrieval , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Alistair Moffat,et al.  Space-Limited Ranked Query Evaluation Using Adaptive Pruning , 2005, WISE.

[10]  Fabrizio Silvestri,et al.  The query-vector document model , 2006, CIKM '06.

[11]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[12]  Yuefeng Li,et al.  Web based collection selection using singular value decomposition , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[13]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[14]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[15]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[16]  Berthier A. Ribeiro-Neto,et al.  Efficient distributed algorithms to build inverted files , 1999, SIGIR '99.

[17]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[18]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[19]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[20]  D. Cheriton,et al.  Simulation of Distributed Search Engines : Comparing Term , Document and Hybrid Distribution 1 , 2009 .

[21]  Donna K. Harman,et al.  Prototyping a distributed information retrieval system that uses statistical ranking , 1991, Inf. Process. Manag..

[22]  W. Bruce Croft,et al.  Efficient document retrieval in main memory , 2007, SIGIR.

[23]  Kathryn S. McKinley,et al.  Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[26]  Torsten Suel,et al.  ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval , 2003, WebDB.

[27]  N. Ziviani,et al.  Distributed query processing using partitioned inverted files , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Alastair G. Smith Information Retrieval: Implementing and Evaluating Search Engines , 2011 .

[30]  Karl Aberer,et al.  Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Ricardo A. Baeza-Yates,et al.  Analyzing imbalance among homogeneous index servers in a web search system , 2007, Inf. Process. Manag..

[32]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[33]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[34]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[35]  Pankaj Mehra,et al.  Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus , 2007, KDD '07.

[36]  William B. Frakes,et al.  Introduction to Information Storage and Retrieval Systems , 1992, Information Retrieval: Data Structures & Algorithms.