Target Word Selection Using WordNet and Data-Driven Models in Machine Translation

Collocation information plays an important role in target word selection of machine translation. However, a collocation dictionary fulfills only a limited portion of selection operation because of data sparseness. To resolve the sparseness problem, we proposed a new methodology that selects target words after determining an appropriate collocation class by using a inter-word semantic similarity. We estimate the similarity by computing semantic distance of two synsets in Word-Net and term-to-term similarity in data-driven models. In WordNet, semantic similarity between two word can be calculated by adapting a reciprocal of the Semantic Distance (SD). For the calculation of the SD, each synset in WordNet is assigned an M- value. The M- value is computed as follows: M- value = \( \tfrac{{radix}} {{sf^p }} \) , where radix is an initial M- value, sf is a scale factor, and p is the number of edges from the root to the synset. As the data-driven models, we utilize Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis(PLSA), a probabilistic application of LSA. LSA applies singular value decomposition (SVD) to the matrix. SVD is a form of factor analysis and is defined as A = UΣVT, where Σ is a diagonal matrix composed of nonzero eigen values of AAT or A T A, and U and V are the orthogonal eigenvectors associated with the r nonzero eigenvalues of AA T and A T A, respectively. The term-to-term similarity is based on the inner products between two row vectors of A, AA T = UΣ2 U T. And To compute the similarity of w1 and w2 in PLSA, P(z∣w1)P(z∣w2) should be approximately computed with being derived from P(z|w) = P(z)P(w|z)//gSz P(z)P(w|z), where z represents contexts.