Roget’s Thesaurus: An additional knowledge source for Textual CBR?

Lenz, Hubner and Kunze have identified Textual CBR as a sub domain of case based reasoning that directly uses text documents. These are used to construct cases in a case base that is indexed using a manually identified vocabulary. A domain dependant similarity function is then required to recognise appropriate cases for a user’s query. This paper describes a new similarity measure for Textual CBR that can be applied to any text. The measure is based upon the construction of a text representation based on the natural coherence of written texts. The Generic Document Profile (GDP) is an attribute-value vector that uses the categories of Roget’s thesaurus as attributes, whose values are calculated algorithmically. This is done by looking for chains of related words whose degree of association can be calculated by reference to Roget’s thesaurus as a general-purpose knowledge source. The GDP is theoretically motivated, and addresses the Ambiguity and Paraphrase problems identified by Lenz et. al. As common in Textual CBR. The paper will also report on the experimental evaluation of the GDP.

[1]  Roy Rada,et al.  Ranking documents with a thesaurus , 1989, JASIS.

[2]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[3]  Graeme Hirst,et al.  Automatically generating hypertext by computing semantic similarity , 1997 .

[4]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[5]  Karen Spärck Jones,et al.  NLP Track at TREC-5 , 1996, TREC.

[6]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[7]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[8]  Agnar Aamodt,et al.  Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches , 1994, AI Commun..

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[11]  Okumura Manabu,et al.  Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion , 1994, COLING.

[12]  Peter C. Patton,et al.  Computing in the humanities , 1981 .

[13]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[14]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[15]  I. Watson CBR is a Methodology not a Technology , 1999 .

[16]  Karen Sparck Jones What is the Role of NLP in Text Retrieval , 1999 .

[17]  Mario Lenz,et al.  Textual CBR , 1998, Case-Based Reasoning Technology.

[18]  Alan F. Smeaton,et al.  Using NLP or NLP Resources for Information Retrieval Tasks , 1999 .

[19]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[20]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[21]  John Tait,et al.  Word Sense Disambiguation by Information Filtering and Extraction , 2000, Comput. Humanit..

[22]  James F. Allen Natural language understanding , 1987, Bejnamin/Cummings series in computer science.