Improving the Reliability of Query Expansion for User-Generated Speech Retrieval Using Query Performance Prediction

The high-variability in content and structure combined with transcription errors makes effective information retrieval (IR) from archives of spoken user generated content (UGC) very challenging. Previous research has shown that using passage-level evidence for query expansion (QE) in IR can be beneficial for improving search effectiveness. Our investigation of passage-level QE for a large Internet collection of UGC demonstrates that while it is effective for this task, the informal and variable nature of UGC means that different queries respond better to alternative types of passages or in some cases use of whole documents rather than extracted passages. We investigate the use of Query Performance Prediction (QPP) to select the appropriate passage type for each query, including the introduction of a novel Weighted Expansion Gain (WEG) as a QPP new method. Our experimental investigation using an extended adhoc search task based on the MediaEval 2012 Search task shows the superiority of using our proposed adaptive QE approach for retrieval. The effectiveness of this method is shown in a per-query evaluation of utilising passage and full document evidence for QE within the inconsistent, uncertain settings of UGC retrieval.

[1]  Zhenmei Gu,et al.  Comparison of using passages and documents for blind relevance feedback in information retrieval , 2004, SIGIR '04.

[2]  Martha Larson,et al.  Comparing retrieval effectiveness of alternative content segmentation methods for Internet video search , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[3]  Martha Larson,et al.  Blip10000: a social video dataset containing SPUG content for tagging and retrieval , 2013, MMSys.

[4]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[5]  Gareth J. F. Jones,et al.  Investigating segment-based query expansion for user-generated spoken content retrieval , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[6]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[7]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[8]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[9]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[10]  Oren Kurland,et al.  Predicting Query Performance by Query-Drift Estimation , 2009, ICTIR.

[11]  Fernando Llopis,et al.  The University of Alicante at CL-SR Track , 2005, CLEF.

[12]  Ying Zhang,et al.  Dublin City University at CLEF 2007: Cross-Language Speech Retrieval Experiments , 2007, CLEF.

[13]  Oren Kurland,et al.  Back to the roots: a probabilistic framework for query-performance prediction , 2012, CIKM.

[14]  James Allan,et al.  Relevance feedback with too much data , 1995, SIGIR '95.

[15]  Jianqiang Wang,et al.  CLEF-2005 CL-SR at Maryland: Document and Query Expansion using Side Collections and Thesauri , 2005, CLEF.

[16]  Gareth J. F. Jones,et al.  Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video , 2016, J. Artif. Intell. Res..

[17]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[18]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[19]  Maria Eskevich Towards effective retrieval of spontaneous conversational spoken content , 2014 .

[20]  Iadh Ounis,et al.  Studying Query Expansion Effectiveness , 2009, ECIR.

[21]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[22]  Gareth J. F. Jones,et al.  Overview of the CLEF-2005 Cross-Language Speech Retrieval Track , 2005, CLEF.