The analysis of noun sequences using semantic information extracted from on-line dictionaries

This dissertation describes a computational system for the automatic analysis of noun sequences in unrestricted text. Noun sequences (also known as noun compounds or complex nominals) have several characteristics which prove to be obstacles to their automatic interpretation. First, the creation of noun sequences is highly productive in English; it is not possible to store all the noun sequences that will be encountered while processing text. Second, their interpretation is not recoverable from syntactic or morphological analysis. Interpreting a noun sequence, i.e., finding the relation between the nouns in a noun sequence, requires semantic information, both in limited domains and in unrestricted text. The semantic analysis in previous computational systems relied heavily on the availability of domain-specific knowledge bases; these have always been handcoded. In this dissertation, we will describe a new approach to the problem of interpreting noun sequences; we also propose a new classification schema for noun sequences, consisting of 14 basic relations/classes. The approach involves a small set of general rules for interpreting NSs which makes use of semantic information extracted from the definitions in on-line dictionaries; the process for automatically acquiring semantic information will be described in detail. Each general rule can be considered as the configuration of semantic features and attributes on the nouns which provide evidence for a particular noun sequence interpretation; the rules access a set of 28 semantic features and attributes. The rules test relatedness between the semantic information and the nouns in the noun sequence. The score for each rule is not determined by the presence or absence of semantic features and attributes, but by the degree to which the nouns are related. The results show that this system interprets 53% of the noun sequences in previously unseen text. An analysis of the results indicates that additional rules are needed and that the semantic information found provides good results, but some semantic information is still missing. For these tests, only information extracted from the definitions were used; on-line dictionaries also contain example sentences which should be exploited, as well as the definitions of words other those in the noun sequence.