Parallel translations of written texts have long been useful tools for human students of language, and have begun to serve as an intriguing source of data for corpus-based approaches to natural language processing. A source text and its translation can be viewed as a coarse map between the two languages, and an industrious student or clever computer program may wish to refine that mapping so that it shows which sentences, phrases, and words are translations of one another. Humans are very adept at finding such relations in parallel text. This is true even when one or both of the languages is unfamiliar, as can be seen in a simple but convincing exercise in (Knight, 1997). While there was considerable early success in automatically identifying sentences in parallel text that are translations of each other (e.g., (Brown, Lai, and Mercer, 1991), (Gale and Church, 1993)), a variety of challenging problems has emerged since that time. Empirical Methods for Exploiting Parallel Texts is a revision of the author’s 1998 Ph.D. dissertation (University of Pennsylvania), and succeeds in capturing the range of problems inherent in parallel text. It presents a variety of techniques for finding translation equivalents and demonstrates that once these are available they can be used to align text segments, detect omissions in translations, identify non-compositional compounds, and discriminate among word senses.
[1]
Kenneth Ward Church,et al.
A Program for Aligning Sentences in Bilingual Corpora
,
1993,
CL.
[2]
Robert L. Mercer,et al.
The Mathematics of Statistical Machine Translation: Parameter Estimation
,
1993,
CL.
[3]
Robert L. Mercer,et al.
Aligning Sentences in Parallel Corpora
,
1991,
ACL.
[4]
Richard O. Duda,et al.
Pattern classification and scene analysis
,
1974,
A Wiley-Interscience publication.
[5]
Kevin Knight,et al.
Automating Knowledge Acquisition for Machine Translation
,
1997,
AI Mag..