A term-based inverted index partitioning model for efficient distributed query processing

In a shared-nothing, distributed text retrieval system, queries are processed over an inverted index that is partitioned among a number of index servers. In practice, the index is either document-based or term-based partitioned. This choice is made depending on the properties of the underlying hardware infrastructure, query traffic distribution, and some performance and availability constraints. In query processing on retrieval systems that adopt a term-based index partitioning strategy, the high communication overhead due to the transfer of large amounts of data from the index servers forms a major performance bottleneck, deteriorating the scalability of the entire distributed retrieval system. In this work, to alleviate this problem, we propose a novel inverted index partitioning model that relies on hypergraph partitioning. In the proposed model, concurrently accessed index entries are assigned to the same index servers, based on the inverted index access patterns extracted from the past query logs. The model aims to minimize the communication overhead that will be incurred by future queries while maintaining the computational load balance among the index servers. We evaluate the performance of the proposed model through extensive experiments using a real-life text collection and a search query sample. Our results show that considerable performance gains can be achieved relative to the term-based index partitioning strategies previously proposed in literature. In most cases, however, the performance remains inferior to that attained by document-based partitioning.

[1]  Berkant Barla Cambazoglu,et al.  Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices , 2008, J. Parallel Distributed Comput..

[2]  William Webber,et al.  Design and Evaluation of a Pipelined Distributed Information Retrieval Architecture , 2007 .

[3]  Berthier A. Ribeiro-Neto,et al.  Parallel generation of inverted files for distributed text collections , 1998, Proceedings SCCC'98. 18th International Conference of the Chilean Society of Computer Science (Cat. No.98EX212).

[4]  Tien-Fu Chen,et al.  Load and storage balanced posting file partitioning for parallel information retrieval , 2011, J. Syst. Softw..

[5]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[6]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007 .

[7]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[8]  Berkant Barla Cambazoglu,et al.  Scalability Challenges in Web Search Engines , 2015, Advanced Topics in Information Retrieval.

[9]  Torsten Suel,et al.  Efficient query evaluation on large textual collections in a peer-to-peer environment , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[10]  A. Bonato,et al.  Graphs and Hypergraphs , 2022 .

[11]  AykanatCevdet,et al.  Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication , 1999 .

[12]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[13]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[14]  Berkant Barla Cambazo ˘ glu,et al.  Models and algorithms for parallel text retrieval (Paralel metin getirme için modeller ve algoritmalar) , 2006 .

[15]  Cevdet Aykanat,et al.  A Parallel Framework for In-Memory Construction of Term-Partitioned Inverted Indexes , 2012, Comput. J..

[16]  Ricardo A. Baeza-Yates,et al.  Distributed Query Processing Using Partitioned Inverted Files , 2001, SPIRE.

[17]  Tien-Fu Chen,et al.  Posting file partitioning and parallel information retrieval , 2002, J. Syst. Softw..

[18]  Andrew B. Kahng,et al.  Recent directions in netlist partitioning: a survey , 1995, Integr..

[19]  Berkant Barla Cambazoglu,et al.  A refreshing perspective of search engine caching , 2010, WWW '10.

[20]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[21]  Stephen E. Robertson,et al.  Parallel search using partitioned inverted files , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[22]  Svein Erik Bratsberg,et al.  Improving the Performance of Pipelined Query Processing with Skipping , 2012, WISE.

[23]  Fabrizio Silvestri,et al.  Mining query logs to optimize index partitioning in parallel web search engines , 2007, Infoscale.

[24]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[25]  Cevdet Aykanat,et al.  Replicated partitioning for undirected hypergraphs , 2012, J. Parallel Distributed Comput..

[26]  Alistair Moffat,et al.  A pipelined architecture for distributed text query evaluation , 2007, Information Retrieval.

[27]  Torsten Suel,et al.  Improved techniques for result caching in web search engines , 2009, WWW '09.

[28]  Svein Erik Bratsberg,et al.  Impact of the Query Model and System Settings on Performance of Distributed Inverted Indexes , 2009 .

[29]  Svein Erik Bratsberg,et al.  A Combined Semi-pipelined Query Processing Architecture for Distributed Full-Text Retrieval , 2010, WISE.

[30]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[31]  Berkant Barla Cambazoglu,et al.  Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems , 2006, ISCIS.

[32]  David R. Karger,et al.  On the Feasibility of Peer-to-Peer Web Indexing and Search , 2003, IPTPS.

[33]  Ricardo A. Baeza-Yates,et al.  Analyzing imbalance among homogeneous index servers in a web search system , 2007, Inf. Process. Manag..

[34]  Svein Erik Bratsberg,et al.  Intra-query Concurrent Pipelined Processing for Distributed Full-Text Retrieval , 2012, ECIR.

[35]  Alistair Moffat,et al.  Load balancing for term-distributed parallel retrieval , 2006, SIGIR.

[36]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[37]  Torsten Suel,et al.  Optimized Inverted List Assignment in Distributed Search Engine Architectures , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[38]  Torsten Suel,et al.  ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval , 2003, WebDB.

[39]  N. Ziviani,et al.  Distributed query processing using partitioned inverted files , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.