Using statistical methods to create a bilingual dictionary

A probabilistic bilingual dictionary assigns to each possible translation a probability measure to indicate how likely the translation is. This master's thesis covers a method to compile a probabilistic bilingual dictionary, (or bilingual lexicon), from a parallel corpus (i.e. large documents that are each others translation). Two research questions are answered in this thesis. In which way can statistical methods applied to bilingual corpora be used to create the bilingual dictionary? And, what can be said about the performance of the created bilingual dictionary in a multilingual document retrieval system? To build the dictionary, we used a statistical algorithm called the EM-algorithm. The EM-algorithm was first used to analyse parallel corpora at IBM in 1990. In this thesis we took a new approach as we developed an EM-algorithm that compiles a bi-directional dictionary. We believe that there are two good reasons to conduct a bi-directional approach instead of a uni-directional approach. First, a bi- directional dictionary will need less space than two uni- directional dictionaries. Secondly, we believe that a bi- directional approach will lead to better estimates of the translation probabilities than the uni-directional approach. We have not yet theoretical proof that our symmetric EM-algorithm is indeed correct. However we do have preliminary results that indicate better performance of our EM-algorithm compared to the algorithm developed at IBM.

[1]  Evelyne Tzoukermann,et al.  The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries , 1990, COLING.

[2]  Sergio Pissanetzky,et al.  Sparse Matrix Technology , 1984 .

[3]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[4]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[5]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[6]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[7]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[8]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[9]  Hideki Hirakawa,et al.  Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information , 1994, COLING.

[10]  Yuji Matsumoto,et al.  Bilingual Text, Matching using Bilingual Dictionary and Statistics , 1994, COLING.

[11]  Michael L. Mauldin,et al.  Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing , 1991 .

[12]  Pim van der Eijk Automating the Acquisition of Bilingual Terminology , 1993, EACL.

[13]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14]  Wessel Kraaij,et al.  Porter's stemming algorithm for Dutch , 1994 .

[15]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  R McKeownKathleen,et al.  Translating collocations for bilingual lexicons , 1996 .

[18]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[19]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[20]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[21]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[22]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[23]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[24]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[26]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .