An Extensible Search Engine Platform for Efficiency Research

Many widely used open source projects for full-text searching are available for industry and academic use. Although they have implemented plenty of optimized strategies and technologies to improve performance of specific phases during text searching, only a few of them have been well-designed to form a complete and extensible search engine platform. Especially for efficiency research, many previous works which utilize parallel mechanism in modern computer system, such like SIMD and GPU, can not be directly integrated in most existing open source projects since they were written in Java. To improve this, we expand NBLucene, which is written in C++, for efficiency research purpose. We build an extensible platform with application programming interfaces for users to implement algorithms and strategies readily. The complete platform consists of three typical components, namely web server, index server and document server. Each component is designed to be extensible so as to accommodate new methods without significant development effort. We also expand original project for taking use of parallelism in GPU to speed-up lists intersection of query process.

[1]  Alistair Moffat,et al.  Index compression using 64-bit words , 2010 .

[2]  Torsten Suel,et al.  Performance of compressed inverted list caching in search engines , 2008, WWW.

[3]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[4]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[5]  Diego Arroyuelo,et al.  Document identifier reassignment and run-length-compressed inverted indexes for improved search performance , 2013, SIGIR.

[6]  Gonzalo Navarro,et al.  Document retrieval on repetitive string collections , 2017, Information Retrieval Journal.

[7]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[8]  Gang Wang,et al.  Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units , 2011, Proc. VLDB Endow..

[9]  Giuseppe Ottaviano,et al.  Partitioned Elias-Fano indexes , 2014, SIGIR.

[10]  Leonid Boytsov,et al.  SIMD compression and the intersection of sorted integers , 2014, Softw. Pract. Exp..

[11]  R. Rice,et al.  Adaptive Variable-Length Coding for Efficient Compression of Spacecraft Television Data , 1971 .

[12]  Gang Wang,et al.  Efficient lists intersection by CPU-GPU cooperative computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[13]  Torsten Suel,et al.  Inverted index compression and query processing with optimized document ordering , 2009, WWW '09.

[14]  Aristides Gionis,et al.  The impact of caching on search engines , 2007, SIGIR.

[15]  Liang Shi,et al.  Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers , 2012, AIRS.

[16]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[17]  Wolfgang Lehner,et al.  Fast Sorted-Set Intersection using SIMD Instructions , 2011, ADMS@VLDB.

[18]  Gang Wang,et al.  NBLucene: Flexible and Efficient Open Source Search Engine , 2016, WAIM.

[19]  Fabrizio Silvestri,et al.  Sorting Out the Document Identifier Assignment Problem , 2007, ECIR.

[20]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[21]  Fabrizio Silvestri,et al.  Caching query-biased snippets for efficient retrieval , 2011, EDBT/ICDT '11.

[22]  Alexander A. Stepanov,et al.  SIMD-based decoding of posting lists , 2011, CIKM '11.

[23]  Jeffrey Dean,et al.  Challenges in building large-scale information retrieval systems: invited talk , 2009, WSDM '09.

[24]  Torsten Suel,et al.  Using graphics processors for high-performance IR query processing , 2008, WWW.