Fishing for Translation Equivalents Using Grammatical Anchors

Bilingual parallel corpora offer a treasure house of human translator’s knowledge of the correspondences between the two languages. Extracting by automatic means the translation equivalents deemed accurate and contextually appropriate by a human translator is of great practical importance for various fields such as example-based machine translation, computational lexicography, information retrieval, etc. The task of word or phrase level identification is greatly reduced if suitable anchor points can be found in the stream of texts. It is suggested that grammatical morphemes provide very useful clues to finding translation equivalents. They typically form a closed set, occur frequently enough in sentences, have more or less fixed meanings, and, most important, will stand in a one-to-one or at most one-to-few relationship with corresponding elements in the other language. This paper will explore the viability of the idea with reference to the Hungarian and English versions of Plato’s Republic, which are available in sentence-aligned form. Hungarian has a rich set of suffixes which are typically deployed in a concatenated manner. Corresponding to them in English are prepositions, auxiliary words, and suffixes. The paper will show how, by starting from a well defined set of correspondences between Hungarian grammatical morphemes and their equivalents and using a combination of pattern matching and heuristics, one can arrive at a mapping of phrases between the two texts.