Range Concatenation Grammars for Translation

Positive and bottom-up non-erasing binary range concatenation grammars (Boullier, 1998) with at most binary predicates ((2,2)-BRCGs) is a O(|G|n6) time strict extension of inversion transduction grammars (Wu, 1997) (ITGs). It is shown that (2,2)-BRCGs induce inside-out alignments (Wu, 1997) and cross-serial discontinuous translation units (CDTUs); both phenomena can be shown to occur frequently in many hand-aligned parallel corpora. A CYK-style parsing algorithm is introduced, and induction from aligment structures is briefly discussed. Range concatenation grammars (RCG) (Boullier, 1998) mainly attracted attention in the formal language community, since they recognize exactly the polynomial time recognizable languages, but recently they have been argued to be useful for data-driven parsing too (Maier and Sogaard, 2008). Bertsch and Nederhof (2001) present the only work to our knowledge on using RCGs for translation. Both Bertsch and Nederhof (2001) and Maier and Sogaard (2008), however, only make use of so-called simple RCGs, known to be equivalent to linear context-free rewrite systems (LCFRSs) (Weir, 1988; Boullier, 1998). Our strict extension of ITGs, on the other hand, makes use of the ability to copy substrings in RCG derivations; one of the things that makes RCGs strictly more expressive than LCFRSs. Copying enables us to recognize the intersection of any two translations that we can recognize and induce the union c © 2008. Licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-nc-sa/3.0/). Some rights reserved. of any two alignment structures that we can induce. Our extension of ITGs in fact introduces two things: (i) A clause may introduce any number of terminals. This enables us to induce multiword translation units. (ii) A clause may copy a substring, i.e. a clause can associate two or more nonterminals A1, . . . An with the same substring and thereby check if the substring is in the intersection of the languages of the subgrammars with start predicate names A1, . . . An. The first point is motivated by studies such as Zens and Ney (2003) and simply reflects that in order to induce multiword translation units in this kind of synchronous grammars, it is useful to be able to introduce multiple terminals simultaneously. The second point gives us a handle on context-sensitivity. It means that (2,2)-BRCGs can define translations such as {〈anbmcndm, anbmdmcn〉 | m,n ≥ 0}, i.e. a translation of cross-serial dependencies into nested ones; but it also means that (2,2)-BRCGs induce a larger class of alignment structures. In fact the set of alignment structures that can be induced is closed under union, i.e. any alignment structure can be induced. The last point is of practical interest. It is shown below that phenomena such as inside-out alignments and CDTUs, which cannot be induced by ITGs, but by (2,2)-BRCGs, occur frequently in many hand-aligned parallel corpora. 1 (2,2)-BRCGs and ITGs (2,2)-BRCGs are positive RCGs (Boullier, 1998) with binary start predicate names, i.e. ρ(S) = 2. In RCG, predicates can be negated (for complementation), and the start predicate name is typically unary. The definition is changed only for aesthetic reasons; a positive RCG with a binary start predicate name S is turned into a positive RCG with a