论文信息 - Using statistical methods to create a bilingual dictionary

Using statistical methods to create a bilingual dictionary

A probabilistic bilingual dictionary assigns to each possible translation a probability measure to indicate how likely the translation is. This master's thesis covers a method to compile a probabilistic bilingual dictionary, (or bilingual lexicon), from a parallel corpus (i.e. large documents that are each others translation). Two research questions are answered in this thesis. In which way can statistical methods applied to bilingual corpora be used to create the bilingual dictionary? And, what can be said about the performance of the created bilingual dictionary in a multilingual document retrieval system? To build the dictionary, we used a statistical algorithm called the EM-algorithm. The EM-algorithm was first used to analyse parallel corpora at IBM in 1990. In this thesis we took a new approach as we developed an EM-algorithm that compiles a bi-directional dictionary. We believe that there are two good reasons to conduct a bi-directional approach instead of a uni-directional approach. First, a bi- directional dictionary will need less space than two uni- directional dictionaries. Secondly, we believe that a bi- directional approach will lead to better estimates of the translation probabilities than the uni-directional approach. We have not yet theoretical proof that our symmetric EM-algorithm is indeed correct. However we do have preliminary results that indicate better performance of our EM-algorithm compared to the algorithm developed at IBM.

Djoerd Hiemstra | D. Hiemstra

[1] Evelyne Tzoukermann,et al. The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries , 1990, COLING.

[2] Sergio Pissanetzky,et al. Sparse Matrix Technology , 1984 .

[3] Doug Arnold,et al. Machine Translation: An Introductory Guide , 1994 .

[4] Martin Kay,et al. Text-Translation Alignment , 1993, Comput. Linguistics.

[5] Vasileios Hatzivassiloglou,et al. Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[6] Kenneth Ward Church,et al. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[7] James Joseph Biundo,et al. Analysis of Contingency Tables , 1969 .

[8] Harold L. Somers,et al. An introduction to machine translation , 1992 .

[9] Hideki Hirakawa,et al. Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information , 1994, COLING.

[10] Yuji Matsumoto,et al. Bilingual Text, Matching using Bilingual Dictionary and Statistics , 1994, COLING.

[11] Michael L. Mauldin,et al. Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing , 1991 .