Contextualizing Retrieval of Full-Length Documents

We address some issues relating to retrieval from unfamiliar text collections consisting of full-length documents. We claim that displaying query results in terms of inter-document similarity is inappropriate with long texts, and suggest instead that the results of simple initial queries should be contextualized according to category sets that correspond to the main topics of the texts. We argue that main topics of long texts should be represented by multiple categories, since in most cases one category cannot adequately classify a text. We describe a new automatic categorization algorithm that does not require pre-labeled texts and a prototype browsing interface that presents a simple mechanism for displaying multi-dimensional information.

[1]  Marti A. Hearst Cases as Structured Indexes for Full-Length Documents , 1993 .

[2]  W. Bruce Croft,et al.  I3R: A new approach to the design of document retrieval systems , 1987, J. Am. Soc. Inf. Sci..

[3]  Wendy A. Lawrence-Fowler,et al.  Integrating query thesaurus, and documents through a common visual representation , 1991, SIGIR '91.

[4]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[5]  Anselm Spoerri,et al.  InfoCrystal: a visual tool for information retrieval & management , 1993, CIKM '93.

[6]  Stuart L. Crawford,et al.  An architecture for probabilistic concept-based information retrieval , 1989, SIGIR '90.

[7]  Ellen Riloff,et al.  Classifying Texts Using Relevancy Signatures , 1992, AAAI.

[8]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[9]  Gerard Salton,et al.  Automatic text structuring experiments , 1992 .

[10]  Lisa F. Rau,et al.  SCISOR: extracting information from on-line news , 1990, CACM.

[11]  F. W. Lancaster,et al.  Vocabulary control for information retrieval , 1972 .

[12]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[13]  Matthew Chalmers,et al.  Bead: explorations in information visualization , 1992, SIGIR '92.

[14]  W. Bruce Croft,et al.  Experiments with query acquisition and use in document retrieval systems , 1989, SIGIR '90.

[15]  Rolf G. Henzler,et al.  Free or controlled vocabularies , 1978 .

[16]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[17]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[18]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[19]  Paul S. Jacobs,et al.  Using statistical methods to improve knowledge-based news categorization , 1993, IEEE Expert.

[20]  Pauline Atherton,et al.  An Analysis of Controlled Vocabulary and Free Text Search Statements in Online Searches , 1980 .

[21]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[22]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[23]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[24]  Robert R. Korfhage,et al.  To see, or not to see— is That the query? , 1991, SIGIR '91.

[25]  Philip J. Hayes,et al.  Intelligent high-volume text processing using shallow, domain-specific techniques , 1992 .

[26]  Elaine Svenonius,et al.  Unanswered questions in the design of controlled vocabularies , 1986, J. Am. Soc. Inf. Sci..

[27]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.