Quantifying the utility of parallel corpora
暂无分享,去创建一个
Our English-Chinese cross-language IR system is trained from parallel corpora; we investigate its performance as a function of training corpus size for three different training corpora. We find that the performance of the system as trained on the three parallel corpora can be related by a simple measure, namely the out-of-vocabulary rate of query words.
[1] J. Scott McCarley. Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.
[2] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.
[3] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.