Pattern recognition for mapping bitext correspondence

The problem of finding token-level correspondences (bitext maps) between the two halves of a bitext can be formulated in terms of pattern recognition. From this point of view, effective solutions hinge on three tasks: signal generation, noise filtering and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR’s accuracy is consistently high for language pairs as diverse as French/English and Chinese/English. If necessary, SIMR’s bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here. SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium1.

[1]  Kenneth Ward Church,et al.  K-vec: A New Approach for Aligning Parallel Texts , 1994, COLING.

[2]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[3]  Philip Resnik,et al.  Semi-Automatic Acquisition of Domain-Specific Translation Lexicons , 1997, ANLP.

[4]  Pascale Fung,et al.  Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping , 1994, AMTA.

[5]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[6]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.

[7]  R. Bellman Dynamic programming. , 1957, Science.

[8]  I. Dan Melamed Automatic Detection of Omissions in Translations , 1996, COLING.

[9]  Elliott Macklovitch Peut-on vérifier automatiquement la cohérence terminologique ? , 1996 .

[10]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[11]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[12]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[13]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[14]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[15]  Kenneth Ward Church,et al.  Robust Bilingual Word Alignment for Machine Aided Translation , 1993, VLC@ACL.

[16]  I. Dan Melamed Porting SIMR to New Language Pairs , 1996 .

[17]  Karin M. Verspoor,et al.  Automatic English-Chinese name transliteration for development of multilingual resources , 1998, ACL.

[18]  I. Dan Melamed A Portable Algorithm for Mapping Bitext Correspondence , 1997, ACL.

[19]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[20]  Hsin-Hsi Chen,et al.  Proper Name Translation in Cross-Language Information Retrieval , 1998, COLING-ACL.

[21]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[22]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[23]  Jean Véronis,et al.  Parallel Text Processing , 2000 .

[24]  R. Vidal Applied simulated annealing , 1993 .