In this paper ,we deal with issues that face an interlingua-based, reversible machine translation system when the~ literal meaning of the source text is not identical to the literal meaning of the natural target translation. We present an algorithm for lexical choice that handles such cases and that relies exclusively, on reversible, monolingual linguistic descriptions and a language-independent domain knowledge base. 1 I n t r o d u c t i o n Machine translation is an obvious application for reversible natural language systems, since both understanding and generation are important parts of the process. There are several arguments for this view (for e:kample, [Isabelle, 89]), including reducing the total cost of adding a new language and making it easier to maintain and validate the resulting system'. Reversible MT systems, just like the broader class of MT systems as a whole, fall into two roughly defined families: transfer systems and interlingua (or pivot systems). Reversible transfer systems (e.g., [van Noord, 90], [Zajac, 90], [Dymetman, 8811 and [Strzalkowski, 90]) exploit three reversible subsystems: one to analyze the source text, one! to perform the transfer, and a third to generate the target text. Interlinguabased systems (e.g., Ultra [Farwell, 90]), on the other hand, require only two reversible components: one to analyze the source text into the interlingua representation, and one to generate the target text f r om that representation. In this paper, we will focus on issues that arise in the design of interlingua-based MT systems. The simplest model of a reversible, interlinguabased system contains two components: one analyzes the source text to create the interlingua representation and the other maps from that to the target text. Unfortunately, the real situation is 61 not that simple, for several reasons, including two that we will focus on here: • This model assumes that the same information is present in the target text as in the source. But in some eases, which have been called translation mismatches [Kameyama, 91], information is either added to or deleted from the source in creating the target. We will show some examples of this below in Section 2. In these eases, the simple reversible system we outlined above would produce unacceptable translations. • Although the notion of a reversible system that describes the set of legal translations is reasonably clearcut, the notion of preferred translation i s more difficult to define [van Noord, 90], [Barnett, 91d]. In some cases, which have been called translation divergences [Dorr, 90], the most natural translation differs from the source in some significant way (e.g., its focus). Of course, in many cases, both of these issues occur together and interact. In this paper, we present some techniques for dealing with these problems. These techniques have three important properties: They require purely declarative, reversible descriptions of the languages that are involved. They require only monolingual facts. Thus new languages can be added to the system without any changes to the descriptions of any other languages. And they are stated in a way that enables their performance to increase gradually along with the power of the underlying knowledge base. 2 T r a n s l a t i o n D i v e r g e n c e s a n d M i s m a t c h e s In this section, we examine some examples in which the source and target languages do not line up. Then, in the rest of the paper, we will outline our solution to these problems. 1. English: "The clogs were running down the street." J a p a n e s e : "inu ga toori-o hashitte-ita."(lit . "dog run (along) the street.") In English, noun phrases must be marked for number. In the natural Japanese translation, number information is absent. 2. Eng l i sh : "I saw a fish in the water." S p a n i s h : "Vi un pez en el agua." English: "I ate a fish." S p a n i s h : "Comi un pescado." Spanish makes a distinction between a fish in its natural state ("pez") and a fish that has been caught for food ("pescado"). "Pez" is also the default form in case it is not clear or does not mat te r what state the fish is in. But it cannot be used if it is clear that the fish has been caught. To get the translation right, it is necessary to infer extra information about the fish, using other knowledge that is available either from the rest of the sentence or from the larger discourse context. Similarly, to reverse the process and go from Spanish to English, it is necessary, in the case of "pescado", to throw away information lest we produce the unnatural translation, "I ate a caught fish." It is important to note, though, that this information cannot be thrown away during understanding, since it would be impor tant if we were translating into another language that made the same distinction. It must be preserved until the point at which generation into the target language takes place. 3. English: "I know him." S p a n i s h : "Lo Conozco." Eng l i sh : "I know the answer." S p a n i s h : "Se la respuesta." Here the issue is the correct translation between English "know" and the two Spanish verbs "conocer" (to be acquainted with someone) and "saber" (to know a fact). This example is similar to the previous one except that here there is no default form. Spanish does not have a word that includes these two different events. . English: "I have a headache." J a p a n e s e : "Atama ga itai." (literally, "my head hurts") Here the problem is more difficult. No longer is it an issue of a single lexical i tem for which there is not an exact match in the target language. Instead, the texts in the two languages differ at the level of an entire phrase, with each language choosing a phrase that describes the situation from a different point of view. In English, we seem to describe an object, "a headache", while Japanese describes the state of a head hurting. The examples that we have just discussed illustrate three different categories of semantic differences between languages: Mismatches caused by semantically significant differences in morphology and syntax, e.g., Example 1. Other common examples involve the presence or absence of markings for gender, number, tense, aspect, and level of politeness. Mismatches caused by lexical differences, where one language has a word that the other lacks, e.g., Examples 2 and 3. Divergences, in which the two languages describe the same state of the world in different ways, as in Example 4. In some of these cases, identical information is conveyed (in the sense that the semantic interpretation of the source implies tha t of the target and vice versa), but in some cases (and depending on the particular model of the world that is being used to define implication) the semantic content of the two forms will not be identical, so many cases of divergence also contain mismatches. 62 Mismatches and divergences are typically viewed as translation (transfer) problems. But in an interlingua-based system it becomes clear that they are primarily problems for generation. The source language analyzer produces an interlingua representation, which the target generator must render into the target language. In cases of mismatch or divergence, doing this requires manipulating the interlingua expression itself since it does not already correspond exactly to the structure of the target string that should be produced. But actually, the fact that the expressions in the interlingua representation came from linguistic expressions in a source language as opposed to from some other source (for example, the output of a problem-solving system) is irrelevant except for a few special caseh in which the form of the source language expreshions can provide help in making generation decis~ions. So, in the rest of this paper, we will present r a generation-centered t reatment of mismatches that relies entirely on reversible, monolingual descriptions of the two languages. 3 The K B N L M T S y s t e m Figure 1 shows a schematic description of the MT system that we :are building. All of the representatat ions in the ifigure, except the source and target language str'ings, are described in terms that are drawn from'~a knowledge base (KB) that describes the domain(s) of discourse. In addition to providing a common set of terms that enable meanings to be:defined, this backend knowledge base is important because it provides the ability to reason about imeanings and thus the ability to add to the target text information that was omitted from the source. We will assume that all the KB-based representations can be treated as sets of logical assertions (although they can of course be implemented in a variety of ways, including the frame-based system [Crawford, 90] that we are using). SOURCE LANGUAGE STRING understand~
[1]
Inderjeet Mani,et al.
Capturing Language-Specific Semantic Distinctions in Interlingua-Based MT
,
1991
.
[2]
Gertjan van Noord,et al.
Semantic-Head-Driven Generation
,
1990,
Comput. Linguistics.
[3]
Inderjeet Mani,et al.
Using Bidirectional Semantic Rules for Generation
,
1990,
INLG.
[4]
Towards reversible MT systems
,
1989,
MTSUMMIT.
[5]
Irene Heim,et al.
The semantics of definite and indefinite noun phrases : a dissertation
,
1982
.
[6]
Bonnie J. Dorr,et al.
Solving Thematic Divergences in Machine Translation
,
1990,
ACL.
[7]
Rémi Zajac.
A relational approach to translation
,
1990
.
[8]
Uchida Hiroshi,et al.
Interlingua for Multilingual Machine Translation
,
1993
.
[9]
R. Jakobson,et al.
Zur Struktur des russischen Verbums
,
1971
.
[10]
Inderjeet Mani,et al.
Shared Preferences
,
1991
.
[11]
Gertjan van Noord,et al.
Reversible Unification Based Machine Translation
,
1990,
COLING.
[12]
Stanley Peters,et al.
Resolving Translation Mismatches With Information Flow
,
1991,
ACL.
[13]
Jr. James Melton Crawford,et al.
Access-limited logic: a language for knowledge representation
,
1991
.
[14]
Henk Zeevat,et al.
An Algorithm for Generation in Unification Categorial Grammar
,
1989,
EACL.