Generating and Retrieving Text Segments for Focused Access to Scientific Documents

When presented with a retrieved document, users of a search engine are usually left with the task of pinning down the relevant information inside the document. Often this is done by a time-consuming combination of skimming, scrolling and Ctrl+F. In the setting of a digital library for scientific literature the issue is especially urgent when dealing with reference works, such as surveys and handbooks, as these typically contain long documents. Our aim is to develop methods for providing a “go-read-here” type of retrieval functionality, which points the user to a segment where she can best start reading to find out about her topic of interest. We examine multiple query-independent ways of segmenting texts into coherent chunks that can be returned in response to a query. Most (experienced) authors use paragraph breaks to indicate topic shifts, thus providing us with one way of segmenting documents. We compare this structural method with semantic text segmentation methods, both with respect to topical focus and relevancy. Our experimental evidence is based on manually segmented scientific documents and a set of queries against this corpus. Structural segmentation based on contiguous blocks of relevant paragraphs is shown to be a viable solution for our intended application of providing “go-read-here” functionality.

[1]  V. Dijk,et al.  Some aspects of text grammars : a study in theoretical linguistics and poetics , 1972 .

[2]  Mehmet Yetis Review of: Lesk, Michael. Understanding digital libraries. 2nd. ed.. San Francisco, CA: Morgan Kaufmann, 2004 , 2005, Inf. Res..

[3]  Michael Lesk Understanding Digital Libraries , 2004 .

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  V. Dijk,et al.  Some Aspects Of Text Grammars , 1972 .

[6]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[7]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[9]  Martha Alice Hearst Context and structure in automated full-text information access , 1994 .

[10]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[11]  Gerard Salton,et al.  Automatic Text Decomposition and Structuring , 1994, Inf. Process. Manag..

[12]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[13]  Mark A. O'Neill,et al.  Practical approach to the stereo matching of urban imagery , 1992, Image Vis. Comput..

[14]  Gabriella Kazai,et al.  Tolerance to irrelevance: a user-effort oriented evaluation of retrieval systems without predefined retrieval unit , 2004 .

[15]  David I. Beaver,et al.  The Handbook of Logic and Language , 1997 .

[16]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[17]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[18]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[19]  Jeff Conklin,et al.  Hypertext: An Introduction and Survey , 1987, Computer.

[20]  Jan van Eijck,et al.  Representing Discourse in Context , 1997, Handbook of Logic and Language.

[21]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[22]  Randall Hagner Trigg,et al.  A network-based approach to text handling for the on-line scientific community , 1983 .

[23]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[24]  James Allan,et al.  Introduction to the Special Issue on Methods and Tools for the Automatic Construction of Hypertext , 1997, Inf. Process. Manag..

[25]  Alan F. Smeaton,et al.  Segmenting broadcast news streams using lexical chains , 2002 .

[26]  Tom Carey,et al.  Labeled, typed links as cues when reading hypertext documents , 1996 .

[27]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[28]  David J. Harper,et al.  A language modelling approach to relevance profiling for document browsing , 2002, JCDL '02.

[29]  E. F. Skorochod'ko Adaptive Method of Automatic Abstracting and Indexing , 1971, IFIP Congress.

[30]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[31]  James Allan Building Hypertext Using Information Retrieval , 1997, Inf. Process. Manag..

[32]  Steven J. DeRose,et al.  Expanding the notion of links , 1989, Hypertext.

[33]  Carol Tenopir,et al.  Reading behaviour and electronic journals , 2002, Learn. Publ..

[34]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[35]  Johan van Benthem,et al.  Handbook of Logic and Language , 1996 .

[36]  Michael Lesk Understanding Digital Libraries, Second Edition (The Morgan Kaufmann Series in Multimedia and Information Systems) , 2004 .