Vocabulary Mapping in the NASA ADS: Prospects for Practical Subject Access

The popular NASA Astrophysics Data System includes bibliographic records indexed with terms from a variety of semi-compatible descriptor languages. These include coordinate index terms taken from the NASA Thesaurus and Astrophysical Journal subject headings, among others. We have worked to develop a system that takes as input the NASA terms assigned by professional indexers, and translates them into ApJ headings. Our system maps sets of descriptors, rather than individual descriptors, since two or more coordinate index terms may translate to a single pre-coordinated subject heading. We began our study with lexical resemblance as the main source of evidence and later developed a connected system that exploits patterns of consistent co-assignment in a subset of the ADS collection that is indexed using both ApJ headings and NASA terms. Our most recent efforts have been aimed at improving the network’s performance via supervised learning. In this paper we present the results of our most recent formal evaluation studies and an examination of some specific documents drawn from a set we’ve mapped using the network. 1. The Heterogeneous Indexing Problem In an ongoing project at the University of Illinois, we have investigated methods to support the automatic and/or computer-assisted reconciliation of heterogeneous indexing in the NASA Astrophysics Data System (ADS). ADS provides astronomers worldwide with access to over a million abstracts and full text articles in the fields of astronomy and astrophysics, instrumentation, physics and geophysics (Eichhorn et al. 1998). A mixture of controlled indexing vocabularies has limited ADS searchers’ ability to conduct precise subject searches, and our investigations have focused on two sources of evidence for resolving the inconsistencies: lexical resemblance between descriptors and consistent assignment of descriptors from different vocabularies to the same documents (Dubin 1998; Lee 1998; Lee, Dubin, & Kurtz 1999). 1.1. Vocabulary Reconciliation Indexing a document is a highly demanding task, and it is hard to elicit explicitly the set of formal rules for indexing. Accordingly, it is not feasible to