Unsupervised Evaluation Metrics and Learning Criteria for Non-Parallel Textual Transfer

We consider the problem of automatically generating textual paraphrases with modified attributes or properties, focusing on the setting without parallel data (Hu et al., 2017; Shen et al., 2017). This setting poses challenges for evaluation. We show that the metric of post-transfer classification accuracy is insufficient on its own, and propose additional metrics based on semantic preservation and fluency as well as a way to combine them into a single overall score. We contribute new loss functions and training strategies to address the different metrics. Semantic preservation is addressed by adding a cyclic consistency loss and a loss based on paraphrase pairs, while fluency is improved by integrating losses based on style-specific language models. We experiment with a Yelp sentiment dataset and a new literature dataset that we propose, using multiple models that extend prior work (Shen et al., 2017). We demonstrate that our metrics correlate well with human judgments, at both the sentence-level and system-level. Automatic and manual evaluation also show large improvements over the baseline method of Shen et al. (2017). We hope that our proposed metrics can speed up system development for new textual transfer tasks while also encouraging the community to address our three complementary aspects of transfer quality.

[1]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[2]  Wei Chen,et al.  Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets , 2017, NAACL.

[3]  Rico Sennrich,et al.  Controlling Politeness in Neural Machine Translation via Side Constraints , 2016, NAACL.

[4]  Yoav Goldberg,et al.  Controlling Linguistic Style Aspects in Neural Language Generation , 2017, ArXiv.

[5]  Yulia Tsvetkov,et al.  Style Transfer Through Back-Translation , 2018, ACL.

[6]  Xu Sun,et al.  Learning Sentiment Memories for Sentiment Modification without Parallel Data , 2018, EMNLP.

[7]  Eric P. Xing,et al.  Unsupervised Text Style Transfer using Language Models as Discriminators , 2018, NeurIPS.

[8]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[9]  Alan Ritter,et al.  Data-Driven Response Generation in Social Media , 2011, EMNLP.

[10]  Regina Barzilay,et al.  Style Transfer from Non-Parallel Text by Cross-Alignment , 2017, NIPS.

[11]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[12]  Cícero Nogueira dos Santos,et al.  Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer , 2018, ACL.

[13]  Samy Bengio,et al.  Content preserving text generation with attribute controls , 2018, NeurIPS.

[14]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[18]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[19]  Iyad Rahwan,et al.  Evaluating Style Transfer for Text , 2019, NAACL.

[20]  Houfeng Wang,et al.  Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach , 2018, ACL.

[21]  Katharina Kann,et al.  Sentence-Level Fluency Evaluation: References Help, But Can Be Spared! , 2018, CoNLL.

[22]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[23]  Dongyan Zhao,et al.  Style Transfer in Text: Exploration and Evaluation , 2017, AAAI.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Richard H. R. Hahnloser,et al.  Large-scale Hierarchical Alignment for Author Style Transfer , 2018, ArXiv.

[26]  Bernt Schiele,et al.  $A^{4}NT$: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation , 2017, 1711.01921.

[27]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[28]  Joel R. Tetreault,et al.  Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer , 2018, NAACL.

[29]  Trung Le,et al.  Dual Discriminator Generative Adversarial Nets , 2017, NIPS.

[30]  Percy Liang,et al.  Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer , 2018, NAACL.

[31]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[32]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[33]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[34]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[35]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Kevin Gimpel,et al.  Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations , 2017, ArXiv.

[37]  Zhe Gan,et al.  Adversarial Text Generation via Feature-Mover's Distance , 2018, NeurIPS.

[38]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[39]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.