Document Compaction for Efficient Query Biased Snippet Generation

Current web search engines return query-biased snippets for each document they list in a result set. For efficiency, search engines operating on large collections need to cache snippets for common queries, and to cache documents to allow fast generation of snippets for uncached queries. To improve the hit rate on a document cache during snippet generation, we propose and evaluate several schemes for reducing document size, hence increasing the number of documents in the cache. In particular, we argue against further improvements to document compression, and argue for schemes that prune documents based on the a priori likelihood that a sentence will be used as part of a snippet for a given document. Our experiments show that if documents are reduced to less than half their original size, 80% of snippets generated are identical to those generated from the original documents. Moreover, as the pruned, compressed surrogates are smaller, 3-4 times as many documents can be cached.

[1]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[2]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[3]  Mario A. Nascimento,et al.  Improving Web search efficiency via a locality based static pruning method , 2005, WWW '05.

[4]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[5]  Jie Lu,et al.  Pruning long documents for distributed information retrieval , 2002, CIKM '02.

[6]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[7]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[8]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[9]  Kathleen F. McCoy,et al.  Efficient text summarization using lexical chains , 2000, IUI '00.

[10]  Justin Zobel,et al.  Efficient query expansion with auxiliary data structures , 2006, Inf. Syst..

[11]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[12]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[13]  Ani Nenkova,et al.  Automatic Summarization , 2011, ACL.

[14]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[15]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[16]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[17]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[18]  Ryen W. White,et al.  A task-oriented study on the influencing effects of query-biased summarisation in web searching , 2003, Inf. Process. Manag..