论文信息 - Pattern recognition for mapping bitext correspondence

Pattern recognition for mapping bitext correspondence

The problem of finding token-level correspondences (bitext maps) between the two halves of a bitext can be formulated in terms of pattern recognition. From this point of view, effective solutions hinge on three tasks: signal generation, noise filtering and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR’s accuracy is consistently high for language pairs as diverse as French/English and Chinese/English. If necessary, SIMR’s bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here. SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium1.

I. Dan Melamed | I. Dan Melamed

[1] Kenneth Ward Church,et al. K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[2] Robert L. Mercer,et al. Aligning Sentences in Parallel Corpora , 1991, ACL.

[3] Philip Resnik,et al. Semi-Automatic Acquisition of Domain-Specific Translation Lexicons , 1997, ANLP.

[4] Pascale Fung,et al. Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[5] Thomas G. Szymanski,et al. A fast algorithm for computing longest common subsequences , 1977, CACM.

[6] Stanley F. Chen,et al. Building Probabilistic Models for Natural Language , 1996, ArXiv.

[7] R. Bellman. Dynamic programming. , 1957, Science.

[8] I. Dan Melamed. Automatic Detection of Omissions in Translations , 1996, COLING.

[9] Elliott Macklovitch. Peut-on vérifier automatiquement la cohérence terminologique ? , 1996 .

[10] Kenneth Ward Church. Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[11] I. Dan Melamed. Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.