论文信息 - Parallel Corpus Refinement as an Outlier Detection Algorithm

Parallel Corpus Refinement as an Outlier Detection Algorithm

Filtering noisy parallel corpora or removing mistranslations out of training sets can improve the quality of a statistical machine translation. Discriminative methods for filtering the corpora such as a maximum entropy model, need properly labeled training data, which are usually unavailable. Generating all possible sentence pairs (the Cartesian product) to generate labeled data, produces an imbalanced training set, containing a few correct translations and thus inappropriate for training a classifier. In order to treat this problem effectively, unsupervised methods are utilized and the problem is modeled as an outlier detection procedure. The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.

Shahram Khadivi | K. Taghipour

[1] Nathan Schneider,et al. Association for Computational Linguistics: Human Language Technologies , 2011 .

[2] Christoph Tillmann,et al. A Beam-Search Extraction Algorithm for Comparable Data , 2009, ACL.

[3] D. Ruppert,et al. On the asymptotics of penalized splines , 2008 .

[4] Alexander Hinneburg,et al. DENCLUE 2.0: Fast Clustering Based on Kernel Density Estimation , 2007, IDA.

[5] Dragos Stefan Munteanu,et al. Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[6] Francisco Casacuberta,et al. Statistical Phrase-Based Models for Interactive Computer-Assisted Translation , 2006, ACL.

[7] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[8] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9] Philip Resnik,et al. Mining the Web for Bilingual Text , 1999, ACL.

[10] Stanley F. Chen,et al. Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[11] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[12] T. Gevers,et al. Variable Kernel Density Estimation of Color Invariant Images , 2022 .