Parallel Corpus Refinement as an Outlier Detection Algorithm

Filtering noisy parallel corpora or removing mistranslations out of training sets can improve the quality of a statistical machine translation. Discriminative methods for filtering the corpora such as a maximum entropy model, need properly labeled training data, which are usually unavailable. Generating all possible sentence pairs (the Cartesian product) to generate labeled data, produces an imbalanced training set, containing a few correct translations and thus inappropriate for training a classifier. In order to treat this problem effectively, unsupervised methods are utilized and the problem is modeled as an outlier detection procedure. The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.