Building lexicous out of a database for idioms

In this paper, we present a computational dictionary for German verbal idioms, called Phraseo-Lex, where idioms are classified according to a wide range of description criteria. It was designed as a source of idiomatic knowledge for both the human user and applications in natural language processing. We show how it can be used for the latter, namely by generating detailed idiomatic entries for a given natural language processing lexicon. We introduce the implementation of a mapping between part of the Phraseo-Lex dictionary entry and one particular NLP lexicon. The goal system used consists of a chart parser, a syntactic formalism similar to PATR-II, and a separate semantic component using a combination of Discourse Representation Theory and -calculus. 1. The Phraseo-Lex Dictionary Idioms, especially verbal idioms, are the most complex multi-word units found in natural language, and a wide range of syntactic, semantic, and pragmatic properties typical for them have been described by different researchers. In this paper we present the Phraseo-Lex system (Keil, 1997), a database for the representation of German verbal idioms. By restricting ourselves to the relatively specialized phenomenon of verbal idioms, we gain the advantage that we can incorporate a large number of features in our dictionary. Phraseo-Lex was designed for two different purposes, namely on the one hand as a computational dictionary for humans and on the other hand as a source and an instrument to generate lexicons for natural language processing systems. This approach lead to a clear separation between the different classification criteria and features as well as to the design of a library of interface functions in order to enable other programs to access the different kinds of linguistic knowledge provided in an easy and unambiguous way. An advantage of this approach is that linguists and builders of natural language processing systems use the same system for different kinds of work, thereby detecting different faults and improving it in different ways. For the traditional linguist, we offer a graphical user interface that enables them to add idiomatic knowledge in an intuitive way. An example for this is an idiom’s syntactic structure and the categories of the participating lexemes, both of which are described together by graphically building a phrase structure tree. The linguistic content of the tree is processed by the system and can be retrieved for each of the idiom parts separately, using query functions from the interface mentioned above. This is especially useful for the generation of lexicons in a grammar formalism where an idiom is not represented as one single unit, but separated into its constituents. For natural language processing, the Phraseo-Lex dictionary may be useful in two different ways: Firstly, the systematic collection of idiomatic knowledge it offers can be used when designing a formal representation of idioms, as it helps to ensure that this representation accounts for all relevant phenomena. Secondly, after such a representation has been planned, the contents of the database can be converted into it automatically, thus adding idiomatic knowledge to a given NLP lexicon. To generate idiomatic lexical entries in this way, an additional lexicon-building program is needed. It works by mapping the information it gets using the Phraseo-Lex interface functions to the representation needed by the goal system. The usefulness of a computational dictionary for this purpose is mainly due to the fact that verbal idioms are quite diverse, and many of their characteristics still seem to be of a rather arbitrary nature. Even Schenk (1995), who tries to come up with a theory that predicts the syntactic behaviour of idioms, admits in the end that “not all operations on idioms have been accounted for. For example, restrictions on passivization or imperative formation in the case of phrasal idioms do not follow from this analysis” (Schenk, 1995). Therefore, when representing idioms in a formal grammar, it is useful to have a detailed linguistic database at one’s disposal, where the knowledge needed has already been entered for the individual idiom. In the remainder of this paper, we first give an overview over some of the idiom characteristics captured in the Phraseo-Lex dictionary in section 2. This overview is restricted to the features most interesting for natural language processing. Then we describe an exemplary mapping algorithm to convert the Phraseo-Lex dictionary entries into a natural language processing system which uses Discourse Representation Theory as its semantic framework in section 4, and prior to this introduce this goal system and an adequate way to represent idioms in it in section 3. 2. Syntax and Semantics of Verbal Idioms The Phraseo-Lex computational dictionary contains an extensive description of German verbal idioms concerning syntactic, semantic, and pragmatic criteria, as well as a graphical user interface and a search component that allows the human user to retrieve all dictionary entries with any given set of characteristics from the database. A complete description of the system from the human user’s point of view can be found in (Keil, 1997). In the following overview, we restrict ourselves to those syntactic and semantic criteria we believe to be relevant for the representation of idioms in natural language processing. 2.1. Lemma and Base Lexemes In Phraseo-Lex, like in conventional dictionaries, a dictionary entry is headed by its lemma. It is represented in a citation form similar to the traditional one for verbal idioms, but extended by information about the subject position. Examples (1) and (2) show lemmata for an idiom with a variable subject, and one where the subject is a fixed part of the idiom. (1) (jmd.) einen Bock schießen (sb.) a buck shoot to make a mistake (2) der Kopf raucht jdm. the head smokes sb. sb.’s head is spinning Additionally, an idiom is indexed with a list of its content words, called the idiom’s base lexemes. They are given in their traditional citation form, which often differs from the inflected form in the idiom’s lemma. When turning a Phraseo-Lex entry into a lexicon entry in a given NLP lexicon, they can be used to identify the words the idiom consists of. 2.2. Syntactic Features We call the classification into idioms with a fixed subject and those with a variable one the idiom’s syntactic type. The remaining syntactic structure is described by means of a phrase structure tree. The phrase structure grammar needed to construct such a tree is implicitly given by an interactive tree building facility, which is part of the PhraseoLex graphical user interface. It contains a display of the tree and a column of syntactic category buttons that can be used to add nodes to the tree. The phrase structure tree serves to encode the verb’s dependents, their case, the words they contain and their syntactic categories. Just like simple verbs, verbal idioms require certain dependents to appear with them in the sentence. These are called the idiom’s external valencies. Additionally, the idiom’s internal structure can also be described in terms of valency theory, namely as the verb’s complements or internal valencies. The idiom in example (1) has one internal valency, einen Bock (a buck), and one external valency, a noun phrase in the nominative case. This valency structure can be computed from the information encoded in the phrase structure tree, with the subject position being taken from the idiom’s syntactic type. A typical characteristic of idioms is their resistance to syntactic manipulations: Many idioms cannot undergo certain transformations without losing their idiomatic reading; furthermore, syntactic anomalies and unique components may occur. In Phraseo-Lex, a given set of syntactic transformations can be marked as possible or impossible for each idiom, for example passivization, relativization, negation, wh-question, adnominal modification, and quantification. Transformations that apply to the idiom as a whole can be marked as possible, impossible or undecidable; for those that apply to a constituent of the idiom, typically to a noun phrase, a list of base lexemes that allow the transformation can be given. A syntactic anomaly is a construction that is considered syntactically incorrect in nonidiomatic language, but nevertheless occurs in some idioms, for example a missing determiner, or a deviation in verb valency. Phraseo-Lex provides a list of syntactic anomalies found in German idioms, from which those applying to the current idiom, if any, can be selected. A unique component is a word that does not exist outside the idiom it occurs in. In Phraseo-Lex, each idiom’s unique component(s) are listed explicitly. 2.3. Semantic Features The classical view on the semantics of idioms is that they do not have an internal semantic structure, i.e. they are semantically noncompositional. This was even taken as one of the defining criteria for idioms. Čermák supports this position when calling the idiom’s noncompositionality “a conditio sine qua non for its semantic substance” and claiming that “semantically, the idiom is a holistic, Gestalt phenomenon, a feature often acknowledged, which excludes any possibility of an objective semantic analysis” (Čermák, 1998). This traditional view was challenged by Wasow et al., who claimed that there exists a class of idioms for which parts of the idiom “have identifiable meanings which combine to produce the meaning of the whole” (Wasow et al., 1983), i.e. a class of compositional idioms. A more recent view recognizes a continuum between fixed idiomatic expressions on the one hand and freely combinable words on the other hand, with different degrees of both syntactic flexibility and semantic analyzability in between (Abeillé, 1995; Dobrolovol’skij, 1995). Taking this into account, we decide for each idiom part separately whether we take it to hav