Experiments with a noun-phrase driven statistical machine translation system

This paper presents a noun phrase driven two-level statistical machine translation system. Noun phrases (NPs) are used as the unit of decomposition to build a two level hierarchy of phrases. English noun phrases are identified using a parser. The corresponding translations are induced using a statistical word alignment model. Identified noun phrase pairs in the training corpus are replaced with a tag to produce a NP tagged corpus. This corpus is then used to extract phrase translation pairs. Both NP translations and NP-tagged phrases are used in a two-level translation decoder: NP translations tag NPs in the first level, where NP-tagged phrases match across NPs to produce translations in the second level. The two-level system shows significant improvements over a baseline SMT system. It also produces longer matching phrases due to the generalization introduced by tagging NPs.