Idiomatic Expression Paraphrasing without Strong Supervision

Idiomatic expressions (IEs) play an essential role in natural language. In this paper, we study the task of idiomatic sentence paraphrasing (ISP), which aims to paraphrase a sentence with an IE by replacing the IE with its literal paraphrase. The lack of large-scale corpora with idiomatic-literal parallel sentences is a primary challenge for this task, for which we consider two separate solutions. First, we propose an unsupervised approach to ISP, which leverages an IE’s contextual information and definition and does not require a parallel sentence training set. Second, we propose a weakly supervised approach using back-translation to jointly perform paraphrasing and generation of sentences with IEs to enlarge the small-scale parallel sentence training dataset. Other significant derivatives of the study include a model that replaces a literal phrase in a sentence with an IE to generate an idiomatic expression and a large scale parallel dataset with idiomatic/literal sentence pairs. The effectiveness of the proposed solutions compared to competitive baselines is seen in the relative gains of over 5.16 points in BLEU, over 8.75 points in METEOR, and over 19.57 points in SARI when the generated sentences are empirically validated on a parallel dataset using automatic and manual evaluations. We demonstrate the practical utility of ISP as a preprocessing step in En-De machine translation.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[3]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[4]  Rebecca Hwa,et al.  A Generalized Idiom Usage Recognition Model Based on Semantic Compatibility , 2019, AAAI.

[5]  Carlos Ramisch,et al.  Joint Dependency Parsing and Multiword Expression Tokenization , 2015, ACL.

[6]  Changsheng Liu,et al.  Representations of Context in Recognizing the Figurative and Literal Usages of Idioms , 2017, AAAI.

[7]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[8]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[9]  Akhilesh Sudhakar,et al.  “Transforming” Delete, Retrieve, Generate Approach for Controlled Text Style Transfer , 2019, EMNLP.

[10]  S. Bhat,et al.  PIE: A Parallel Idiomatic Expression Corpus for Idiomatic Sentence Generation and Paraphrasing , 2021, MWE.

[11]  Chaitra Hegde,et al.  Unsupervised Paraphrase Generation using Pre-trained Language Models , 2020, ArXiv.

[12]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[13]  Kuan-Hao Huang,et al.  Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs , 2021, EACL.

[14]  Jinjun Xiong,et al.  Reinforcement Learning Based Text Style Transfer without Parallel Training Corpus , 2019, NAACL.

[15]  Simone Sprenger,et al.  Fixed expressions and the production of idioms , 2003 .

[16]  Christof Monz,et al.  Examining the Tip of the Iceberg: A Data Set for Idiom Translation , 2018, LREC.

[17]  Christiane Hümmer,et al.  Polysemy and Vagueness in Idioms: A Corpus-based Analysis of Meaning , 2006 .

[18]  Joakim Nivre,et al.  Multiword Units in Syntactic Parsing , 2004 .

[19]  Mohit Iyyer,et al.  Reformulating Unsupervised Style Transfer as Paraphrase Generation , 2020, EMNLP.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Cecile Paris,et al.  Leveraging Sentiment Distributions to Distinguish Figurative From Literal Health Reports on Twitter , 2020, WWW.

[22]  Harsh Jhamtani,et al.  Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence Models , 2017, Proceedings of the Workshop on Stylistic Variation.

[23]  Zhongyu Wei,et al.  Extract, Transform and Filling: A Pipeline Model for Question Paraphrasing based on Template , 2019, EMNLP.

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Ankush Gupta,et al.  A Deep Generative Framework for Paraphrase Generation , 2017, AAAI.

[26]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[27]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[28]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[29]  Chris Callison-Burch,et al.  Optimizing Statistical Machine Translation for Text Simplification , 2016, TACL.

[30]  Ming-Yu Liu,et al.  Style Example-Guided Text Generation using Generative Adversarial Transformers , 2019, ArXiv.

[31]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[32]  Suma Bhat,et al.  From Solving a Problem Boldly to Cutting the Gordian Knot: Idiomatic Text Generation , 2021, ArXiv.

[33]  Changsheng Liu,et al.  Heuristically Informed Unsupervised Idiom Usage Recognition , 2018, EMNLP.

[34]  John D. Kelleher,et al.  An Empirical Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of English to Brazilian-Portuguese , 2014, HyTra@EACL.

[35]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[36]  Carlos Ramisch,et al.  Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time , 2016, ACL.

[37]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38]  Changsheng Liu Toward Robust and Efficient Interpretations of Idiomatic Expressions in Context , 2019 .

[39]  Malvina Nissim,et al.  MAGPIE: A Large Corpus of Potentially Idiomatic Expressions , 2020, LREC.

[40]  Pramod Viswanath,et al.  Geometry of Compositionality , 2017, AAAI.

[41]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[42]  Changsheng Liu,et al.  Phrasal Substitution of Idiomatic Expressions , 2016, NAACL.

[43]  C. Norbury Factors supporting idiom comprehension in children with communication disorders. , 2004, Journal of speech, language, and hearing research : JSLHR.

[44]  Percy Liang,et al.  Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer , 2018, NAACL.

[45]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[46]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[47]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[48]  Suma Bhat,et al.  Idiomatic Expression Identification using Semantic Compatibility , 2021, Transactions of the Association for Computational Linguistics.

[49]  Suma Bhat,et al.  GRUEN for Evaluating Linguistic Quality of Generated Text , 2020, FINDINGS.

[50]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[51]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[52]  Marie-Luise Pitzl World Englishes and creative idioms in English as a lingua franca , 2016 .

[53]  I. Sag,et al.  Idioms , 2015 .

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.