Fast generation of result snippets in web search

The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.

[1]  Amanda Spink,et al.  A temporal comparison of AltaVista Web searching: Research Articles , 2005 .

[2]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[3]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[4]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[5]  Alistair Moffat,et al.  Text Compression for Dynamic Document Databases , 1997, IEEE Trans. Knowl. Data Eng..

[6]  Steven Garcia,et al.  Access-Ordered Indexes , 2004, ACSC.

[7]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[8]  Amanda Spink,et al.  A temporal comparison of AltaVista Web searching , 2005, J. Assoc. Inf. Sci. Technol..

[9]  Ryen W. White,et al.  Finding relevant documents using top ranking sentences: an evaluation of two alternative schemes , 2002, SIGIR '02.

[10]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[11]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[14]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[15]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[16]  Kathleen F. McCoy,et al.  Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization , 2002, CL.

[17]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[18]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[19]  Gerald J. Sussman,et al.  Structure and interpretation of computer programs , 1985, Proceedings of the IEEE.

[20]  Hugh E. Williams,et al.  The Zettair Search Engine , 1998 .

[21]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[22]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[23]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[24]  Mark Sanderson,et al.  Advantages of query biased summaries in information retrieval , 1998, SIGIR '98.

[25]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[26]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.