Automatic indexing using selective NLP and first-order thesauri

As one approach to automatic indexing, the CLARIT System utilizes selective natural-language processing (NLP) to identify candidate noun phrases in free text and maps them into candidate terms, in a morphologically-normalized form, emphasizing modifier and head relations. Candidate terms are matched against a first-order thesaurus of certified domain-specific terminology. Terms are scored and ranked based on the distribution statistics of the term (and its lexical items) in a document. Terms are weighted, as well, according to their distribution both in a reference domain database and a large, general corpus of English. The result is a tripartite indexing of a document by terms classified as exact (or certified), general, and novel, each ranked for relevance. In an evaluation comparing CLARIT automatic indexing of ten full-text articles in the domain of artificial intelligence to the indexing of two human subjects, it was found that CLARIT performed as well---and in some respects better---than the humans.