论文信息 - Empirical Methods for Compound Splitting

Empirical Methods for Compound Splitting

Compounded words are a challenge for NLP applications such as machine translation (MT). We introduce methods to learn splitting rules from monolingual and parallel corpora. We evaluate them against a gold standard and measure their impact on performance of statistical MT systems. Results show accuracy of 99.1% and performance gains for MT of 0.039 BLEU on a German-English noun phrase translation task.

Philipp Koehn | Kevin Knight | Philipp Koehn | Kevin Knight

[1] Wolfgang Finkler,et al. MORPHIX A Fast Realization of a Classification-Based Approach to Morphology , 1988 .

[2] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[3] Stefan Langer,et al. Zur Morphologie und Semantik von Nominalkomposita , 1998 .

[4] Martha Larson,et al. Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches , 2000, INTERSPEECH.

[5] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[6] Maarten de Rijke,et al. Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[7] Philipp Koehn,et al. Knowledge Sources for Word-Level Translation Models , 2001, EMNLP.

[8] Turid Hedlund,et al. Utaclir @ CLEF 2001 - Effects of Compound Splitting and N-Gram Techniques , 2001, CLEF.

[9] Daniel Marcu,et al. Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[10] Daniel Marcu,et al. A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[11] Ralf D. Brown. Corpus-driven splitting of compound words. , 2002, TMI.

[12] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.