Extracting Relevant Snippets from Web Documents through Language Model based Text Segmentation

Extracting a query-oriented snippet (or passage) and highlighting the relevant information in long document can help reduce the result navigation cost of end users. While the traditional approach of highlighting matching keywords helps when the search is keyword oriented, finding appropriate snippets to represent matches to more complex queries requires novel techniques that can help characterize the relevance of various parts of a document to the given query, succinctly. In this paper, we present a languagemodel based method for accurately detecting the most relevant passages of a given document. Unlike previous works in passage retrieval which focus on searching relevance nodes for filtering of preoccupied passages, we focus on query-informed segmentation for snippet extraction. The algorithms presented in this paper are currently being deployed in OASIS, a system to help reduce the navigational load of blind users in accessing Web-based digital libraries.

[1]  Mathias Bauer,et al.  Instructible information agents for Web mining , 2000, IUI '00.

[2]  K. Selçuk Candan,et al.  CUTS: CUrvature-based development pattern analysis and segmentation for blogs and other Text Streams , 2006, HYPERTEXT '06.

[3]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[4]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[5]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[6]  Eric Horvitz,et al.  BusyBody: creating and fielding personalized models of the cost of interruption , 2004, CSCW.

[7]  Pattie Maes,et al.  Agents that reduce work and information overload , 1994, CACM.

[8]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[9]  Brian P. Bailey,et al.  Measuring the effects of interruptions on task performance in the user interface , 2000, Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. 'cybernetics evolving to systems, humans, organizations, and their complex interactions' (cat. no.0.