Ranking documents with a thesaurus

This article reports on exploratory experiments in evaluating and improving a thesaurus through studying its effect on retrieval. A formula called DISTANCE was developed to measure the conceptual distance between queries and documents encoded as sets of thesaurus terms. DISTANCE references MeSH (Medical Subject Headings) and assesses the degree of match between a MeSH-encoded query and document. The performance of DISTANCE on MeSH is compared to the performance of people in the assessment of conceptual distance between queries and documents, and is found to simulate with surprising accuracy the human performance. The power of the computer simulation stems both from the tendency of people to rely heavily on broader-than (BT) relations in making decisions about conceptual distance and from the thousands of accurate BT relations in MeSH. One source for discrepancy between the algorithms' measurement of closeness between query and document and people's measurement of closeness between query and document is occasional inconsistency in the BT relations. Our experiments with adding non-BT relations to MeSH showed how these non-BT non-BT relations to MeSH showed how these non-BT relations could improve document ranking, if DISTANCE were also appropriately revised to treat these relations differently from BT relations.