Acquiring Paraphrases from Corpora and Its Application to Machine Translation

A natural language contains various paraphrases, that is, superficially different expressions that share the same meaning. Such a wide variety of paraphrases reflects the rich expressiveness of natural language, while causing difficulty in natural language processing applications, such as machine translation (MT). For MT, this variety reduces the coverage of translatable input sentences and complicates language too much to comprehend every possible variation. Unfortunately, existing resources for paraphrases do not adequately deal with the difficulty because their paraphrase knowledge only covers general areas and has little effect on uses for specific domains and applications. This thesis describes corpus-based paraphrase acquisition and its application to MT. We propose two paraphrase acquisition methods: lexical paraphrases and sentential paraphrases, each of which has its own advantages. Both methods are based on shallow analysis, and rely on a corpus but no other resource. The achievements described in this thesis consist of three parts: analysis of manual paraphrases, automatic acquisition of lexical paraphrases, and similar sentence retrieval, which corresponds to sentential paraphrasing. First, we describe two analyses of human paraphrases to clarify the following questions: (1) what types of paraphrases are dominant? and (2) how can human paraphrases be effective for MT? These investigations suggest that lexical paraphrasing and sentential paraphrasing are dominant in travel conversation domains. Second, we describe a method for extracting lexical paraphrases from a parallel corpus. This method has two advantages: (1) it acquires not only synonymous content ∗Doctoral Dissertation, Department of Information Processing, Graduate School of Information Science, Nara Institute of Science and Technology, NAIST-IS-DD0261014, September 15, 2004.

[1]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[2]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[3]  Forbes Ave. Pittsburgh Automatic Rewriting for Controlled Language Translation , 2001 .

[4]  Kentaro Inui,et al.  Text Simplification for Reading Assistance: A Project Note , 2003, IWP@ACL.

[5]  Seiichi Yamamoto Toward speech communications beyond language barrier - research of spoken language translation technologies at ATR - , 2000, INTERSPEECH.

[6]  Jason Baldridge,et al.  Verbmobil: Foundations of Speech-to-Speech Translation, by Wolfgang Wahlster (editor). Springer. 2000. ISBN 3-540-67783-6. Price £44.50 (hardback). xii+679 pages , 2004, Natural Language Engineering.

[7]  Yuji Matsumoto,et al.  Building a Paraphrase Corpus for Speech Translation , 2004, LREC.

[8]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[9]  Satoshi Sekine,et al.  Automatic paraphrase acquisition from news articles , 2002 .

[10]  Taro Watanabe,et al.  A corpus-centered approach to spoken language translation , 2003, EACL.

[11]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[12]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[14]  Taro Watanabe,et al.  Statistical Machine Translation on Paraphrased Corpora , 2002, LREC.

[15]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[18]  Toshiyuki Takezawa,et al.  Proposal of a very-large-corpus acquisition method by cell-formed registration , 2002, LREC.

[19]  Taro Watanabe,et al.  Example-based Decoding for Statistical Machine Translation , 2003 .

[20]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[21]  Satoshi Sato,et al.  Verb Paraphrase based on Case Frame Alignment , 2002, ACL.

[22]  Kazuhide Yamamoto,et al.  Applicability Analysis of Corpus-derived Paraphrases toward Example-based Paraphrasing , 2003, PACLIC.

[23]  Kazuhide Yamamoto Acquisition of Lexical Paraphrases from Texts , 2002, COLING 2002.

[24]  Makoto Nagao,et al.  A framework of a mechanical translation between Japanese and English by analogy principle , 1984 .

[25]  France T́elécom Learning Paraphrases to Improve a Question-Answering System , 2003 .

[26]  Eiichiro Sumita,et al.  Automatic paraphrasing based on parallel corpus for normalization , 2002, LREC.

[27]  Alexander H. Waibel,et al.  Interactive Translation of Conversational Speech , 1996, Computer.

[28]  Evelyne Tzoukermann,et al.  Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax , 1997, ACL.

[29]  Yuji Matsumoto,et al.  Feedback Cleaning of Machine Translation Rules Using Automatic Evaluation , 2003, ACL.

[30]  Yuji Matsumoto,et al.  Retrieving Meaning-equivalent Sentences for Example-based Rough Translation , 2003, ParallelTexts@NAACL-HLT.

[31]  M. Carl Drott,et al.  Information Retrieval Systems: Theory and Implementation, by Gerald Kowalski , 1998, Journal of the American Society for Information Science.

[32]  Gianni Lazzari The VI Framework Program in Europe: Some Thoughts About Speech to Speech Translation Research , 2002, Speech-to-Speech Translation@ACL.

[33]  Eiichiro Sumita,et al.  Translation using Information on Dialogue Participants , 2000, ANLP.

[34]  K. Kondo Summarization with Dictionary-based Paraphrasing , 1997 .

[35]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[36]  Sadao Kurohashi,et al.  Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation , 2000, COLING.

[37]  Yuji Matsumoto,et al.  A Method for Retrieving a Similar Sentence and Its Application to Speech Translation , 2004 .

[38]  Toshiyuki Takezawa,et al.  Collecting machine-translation-aided bilingual dialogues for corpus-based speech translation , 2003, INTERSPEECH.

[39]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[40]  Siobhan Devlin,et al.  Simplifying Text for Language-Impaired Readers , 1999, EACL.

[41]  Hermann Ney,et al.  Confidence measures for statistical machine translation , 2003, MTSUMMIT.

[42]  Yuji Matsumoto,et al.  Effects of Structural Matching and Paraphrasing in Question Answering , 2003 .

[43]  Eiichiro Sumita,et al.  Input Sentence Splitting and Translating , 2003, ParallelTexts@NAACL-HLT.

[44]  Claudia Gdaniec,et al.  MTranslatability , 2001, Machine Translation.

[45]  E. Sumita,et al.  Converting Morphological Information Using Lexicalized and General Conversion , 2001, CICLing.

[46]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[47]  Kentaro Torisawa A Nearly Unsupervised Learning Method for Automatic Paraphrasing of Japanese Noun Phrases , 2001 .

[48]  M. F. PERUTZ,et al.  International Conferences , 1969, Nature.

[49]  Eiichiro Sumita Example-based machine translation using DP-matching between work sequences , 2001, DDMMT@ACL.

[50]  Peter Mark Roget,et al.  Roget's International Thesaurus , 1977 .

[51]  Fabio Pianesi,et al.  The NESPOLE! Speech-to-Speech Translation System , 2002, AMTA.

[52]  Ralf D. Brown,et al.  Automated Generalization of Translation Examples , 2000, COLING.

[53]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[54]  Daniel Marcu,et al.  Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences , 2003, NAACL.

[55]  Michael Carl Inducing Translation Templates for Example-Based Machine Translation , 1999 .

[56]  Yuji Matsumoto,et al.  Sructural Matching of Parallel Texts , 1993, ACL.

[57]  Yuji Matsumoto,et al.  Example-based rough translation for speech-to-speech translation , 2003, MTSUMMIT.