Anserini: Enabling the Use of Lucene for Information Retrieval Research

Software toolkits play an essential role in information retrieval research. Most open-source toolkits developed by academics are designed to facilitate the evaluation of retrieval models over standard test collections. Efforts are generally directed toward better ranking and less attention is usually given to scalability and other operational considerations. On the other hand, Lucene has become the de facto platform in industry for building search applications (outside a small number of companies that deploy custom infrastructure). Compared to academic IR toolkits, Lucene can handle heterogeneous web collections at scale, but lacks systematic support for evaluation over standard test collections. This paper introduces Anserini, a new information retrieval toolkit that aims to provide the best of both worlds, to better align information retrieval practice and research. Anserini provides wrappers and extensions on top of core Lucene libraries that allow researchers to use more intuitive APIs to accomplish common research tasks. Our initial efforts have focused on three functionalities: scalable, multi-threaded inverted indexing to handle modern web-scale collections, streamlined IR evaluation for ad hoc retrieval on standard test collections, and an extensible architecture for multi-stage ranking. Anserini ships with support for many TREC test collections, providing a convenient way to replicate competitive baselines right out of the box. Experiments verify that our system is both efficient and effective, providing a solid foundation to support future research.

[1]  Andrew Trotman,et al.  Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) , 2016, SIGF.

[2]  Jimmy J. Lin,et al.  Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search , 2009, TREC.

[3]  Hao Wu,et al.  VIRLab: a web-based virtual lab for learning and studying information retrieval models , 2014, SIGIR.

[4]  Hui Fang,et al.  A Reproducibility Study of Information Retrieval Models , 2016, ICTIR.

[5]  J. Shane Culpepper,et al.  Assessing efficiency–effectiveness tradeoffs in multi-stage retrieval systems without using relevance judgments , 2015, Information Retrieval Journal.

[6]  Ben Carterette,et al.  Overview of the TREC 2014 Session Track , 2014, TREC.

[7]  Marc-Allen Cartright,et al.  Galago: A Modular Distributed Processing and Retrieval System , 2012, OSIR@SIGIR.

[8]  Jimmy J. Lin,et al.  Old dogs are great at new tricks: column stores for ir prototyping , 2014, SIGIR.

[9]  Jimmy J. Lin,et al.  Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures , 2013, SIGIR.

[10]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[11]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[12]  Craig MacDonald,et al.  Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge , 2016, ECIR.

[13]  Bhaskar Mitra,et al.  Neural Models for Information Retrieval , 2017, ArXiv.

[14]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[15]  Andrew Trotman,et al.  Improvements to BM25 and Language Models Examined , 2014, ADCS.

[16]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.