Neural Embedding-Based Metrics for Pre-retrieval Query Performance Prediction

Query Performance Prediction (QPP) is concerned with estimating the effectiveness of a query within the context of a retrieval model. It allows for operations such as query routing and segmentation, leading to improved retrieval performance. Pre-retrieval QPP methods are oblivious to the performance of the retrieval model as they predict query difficulty prior to observing the set of documents retrieved for the query. Since neural embedding-based models are showing wider adoption in the Information Retrieval (IR) community, we propose a set of pre-retrieval QPP metrics based on the properties of pre-trained neural embeddings and show that such metrics are more effective for query performance prediction compared to the widely known QPP metrics such as SCQ, PMI and SCS. We report our findings based on Robust04, ClueWeb09 and Gov2 corpora and their associated TREC topics.

[1]  Tuukka Ruotsalo,et al.  Why do Users Issue Good Queries?: Neural Correlates of Term Specificity , 2019, SIGIR.

[2]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[3]  Amit P. Sheth,et al.  Characterising Concepts of Interest Leveraging Linked Data and the Social Web , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Peter Bailey,et al.  Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction , 2017, ADCS.

[6]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[7]  Elad Yom-Tov,et al.  Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[8]  Djoerd Hiemstra,et al.  A survey of pre-retrieval query performance predictors , 2008, CIKM '08.

[9]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[10]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[11]  Guido Zuccon,et al.  Integrating and Evaluating Neural Word Embeddings in Information Retrieval , 2015, ADCS.

[12]  Oren Kurland,et al.  Predicting Query Performance by Query-Drift Estimation , 2009, TOIS.

[13]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[14]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[15]  Gareth J. F. Jones,et al.  Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction , 2019, Inf. Process. Manag..

[16]  Faezeh Ensan,et al.  Neural word and entity embeddings for ad hoc retrieval , 2018, Inf. Process. Manag..

[17]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[18]  Krisztian Balog,et al.  Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval , 2019, SIGIR.

[19]  Zhiting Hu,et al.  Joint Embedding of Hierarchical Categories and Entities for Concept Categorization and Dataless Classification , 2016, COLING.

[20]  Dominik Benz,et al.  One Tag to Bind Them All: Measuring Term Abstractness in Social Metadata? , 2011, LWA.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[23]  Iadh Ounis,et al.  Inferring Query Performance Using Pre-retrieval Predictors , 2004, SPIRE.

[24]  Iadh Ounis,et al.  University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier , 2004, TREC.

[25]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[26]  M. de Rijke,et al.  Differentiable Unbiased Online Learning to Rank , 2018, CIKM.

[27]  Amit P. Sheth,et al.  User Interests Identification on Twitter Using a Hierarchical Knowledge Base , 2014, ESWC.

[28]  Jelena Jovanovic,et al.  Geometric Estimation of Specificity within Embedding Spaces , 2019, CIKM.

[29]  Leif Azzopardi,et al.  A comparison of user and system query performance predictions , 2010, CIKM '10.

[30]  Ellen M. Voorhees,et al.  The TREC robust retrieval track , 2005, SIGF.

[31]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[32]  J. Shane Culpepper,et al.  Information Needs, Queries, and Query Performance Prediction , 2019, SIGIR.

[33]  Falk Scholer,et al.  Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.

[34]  Claudia Hauff,et al.  Predicting the effectiveness of queries and retrieval systems , 2010, SIGF.

[35]  J. Shane Culpepper,et al.  Neural Query Performance Prediction using Weak Supervision from Multiple Signals , 2018, SIGIR.

[36]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[37]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[38]  Lynn A. Streeter,et al.  Two meanings of word abstractness , 1971 .

[39]  William H. Offenhauser,et al.  Wild Boars as Hosts of Human-Pathogenic Anaplasma phagocytophilum Variants , 2012, Emerging infectious diseases.

[40]  Tom A. B. Snijders,et al.  Social Network Analysis , 2011, International Encyclopedia of Statistical Science.

[41]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[42]  Qiao Zhang,et al.  Fuzziness - vagueness - generality - ambiguity , 1998 .

[43]  Laure Thompson,et al.  The strange geometry of skip-gram with negative sampling , 2017, EMNLP.

[44]  Robert B. Allen,et al.  Generality of Texts , 2002, ICADL.

[45]  Santiago Segarra,et al.  Stability and continuity of centrality measures in weighted graphs , 2014, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Gabriel Furmuzachi,et al.  WORDS AND THINGS , 1906, British medical journal.

[47]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[48]  Krisztian Balog,et al.  Ad Hoc Table Retrieval using Semantic Similarity , 2018, WWW.

[49]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[50]  M. de Rijke,et al.  Using Coherence-Based Measures to Predict Query Difficulty , 2008, ECIR.

[51]  Josiane Mothe,et al.  Why do you Think this Query is Difficult?: A User Study on Human Query Prediction , 2016, SIGIR.

[52]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.