Alignment-guided chunking

We introduce an adaptable monolingual chunking approach–Alignment-Guided Chunking (AGC)–which makes use of knowledge of word alignments acquired from bilingual corpora. Our approach is motivated by the observation that a sentence should be chunked differently depending the foreseen end-tasks. For example, given the different requirements of translation into (say) French and German, it is inappropriate to chunk up an English string in exactly the same way as preparation for translation into one or other of these languages. We test our chunking approach on two language pairs: French–English and German–English, where these two bilingual corpora share the same English sentences. Two chunkers trained on French–English (FE-Chunker) and German–English(DE-Chunker ) respectively are used to perform chunking on the same English sentences. We construct two test sets, each suitable for French– English and German–English respectively. The performance of the two chunkers is evaluated on the appropriate test set and with one reference translation only, we report Fscores of 32.63% for the FE-Chunker and 40.41% for the DE-Chunker.

[1]  Andy Way,et al.  Robust large-scale EBMT with marker-based segmentation , 2004, TMI.

[2]  Wei Wang,et al.  Structure Alignment Using Bilingual Chunking , 2002, COLING.

[3]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[4]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[5]  Yanjun Ma,et al.  Bootstrapping Word Alignment via Word Packing , 2007, ACL.

[6]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[7]  Andy Way,et al.  MATREX: DCU machine translation system for IWSLT 2006. , 2006, IWSLT.

[8]  Yanjun Ma,et al.  MaTrEx: the DCU machine translation system for IWSLT 2007 , 2007, IWSLT.

[9]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[10]  Bo Xu,et al.  Bilingual Chunk Alignment Based on Interactional Matching and Probabilistic Latent Semantic Indexing , 2004, IJCNLP.

[11]  Thomas R. G. Green,et al.  The necessity of syntax markers: Two experiments with artificial languages , 1979 .

[12]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[15]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[16]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[17]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[18]  Rob Koeling Chunking with Maximum Entropy Models , 2000, CoNLL/LLL.

[19]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[20]  Jörg Tiedemann Word to word alignment strategies , 2004, COLING.

[21]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[22]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[23]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[24]  Andy Way,et al.  Hybrid Example-Based SMT: the Best of Both Worlds? , 2005, ParallelText@ACL.