The cultivation of a Chinese-English-Japanese trilingual parallel corpus from comparable patents

Ranging from machine translation (MT) to cross-lingual information retrieval, many NLP applications require parallel corpora as critical resources. Given the phenomenal growth in patents and in the need to mediate between different languages, we explore a new but important area involving patents by investigating how a Chinese-English-Japanese trilingual parallel corpora can be cultivated from comparable patents, and introduce our mined trilingual corpus, which demonstrates the considerable potential of cultivating large-scale parallel corpora from comparable patents.

[1]  Stephan Vogel,et al.  Adaptive parallel sentences mining from web bilingual news collection , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[3]  Michel Simard,et al.  Bilingual Sentence Alignment: Balancing Robustness and Accuracy , 2004, Machine Translation.

[4]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[5]  Gabriela Fernandez,et al.  Mutual Bilingual Terminology Extraction , 2008, LREC.

[6]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[7]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[8]  Benjamin Ka-Yin T'sou,et al.  Towards Bilingual Term Extraction in Comparable Patents , 2009, PACLIC.

[9]  Oi Yee Kwong,et al.  Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT , 2010, CIPS-SIGHAN.

[10]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Dekai Wu,et al.  Learning an English-Chinese Lexicon from a Parallel Corpus , 1994, AMTA.

[13]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[14]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[15]  Benjamin Van Durme,et al.  Mining Parenthetical Translations from the Web by Word Alignment , 2008, ACL.

[16]  Tetsuya Ishikawa,et al.  PRIME: A System for Multi-lingual Patent Retrieval , 2002, ArXiv.

[17]  Pascale Fung,et al.  Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora , 2005, IJCNLP.

[18]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[19]  Oi Yee Kwong,et al.  Mining parallel knowledge from comparable patents , 2011 .

[20]  Julian Kupiec,et al.  An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora , 1993, ACL.

[21]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[22]  Masao Utiyama,et al.  Overview of the Patent Translation Task at the NTCIR-7 Workshop , 2008, NTCIR.

[23]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[24]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[25]  LU Bin Building a Large English-Chinese Parallel Corpus from Comparable Patents and its Experimental Application to SMT , 2011 .

[26]  Qingsheng Zhu,et al.  Mining Bilingual Data from the Web with Adaptively Learnt Patterns , 2009, ACL/IJCNLP.