Visual Paraphrase Generation with Key Information Retained

Visual paraphrase generation task aims to rewrite a given image-related original sentence into a new paraphrase, where the paraphrase needs to have the same expressed meaning as the original sentence, but have a difference in expression form. Existing studies mainly extract two semantic vectors to represent the entire image and the entire original sentence respectively for paraphrase generation. However, these semantic vectors for an image or a sentence may lead to the model failing to focus on some key objects in the original sentence, which may generate semantically inconsistent sentences by changing key object information. In this paper, we propose an object-level paraphrase generation model, which generates paraphrases by adjusting the permutation of key objects and modifying their associated descriptions. To adjust the permutation of key objects, an object sorting module aims to obtain a newly object sequences based on the key object information and original sentences. Then, a sequence generation module sequentially generates paraphrases based on the permutation of the newly object sequences. Each generation step focuses on different image features associated with different key objects to generate descriptions with differences. Furthermore, we use a semantic discriminator module to promote the generated paraphrase to be semantically close to the original sentence. Specifically, the loss function of the discriminator penalizes the excessive distance between the paraphrase and the original sentence. Extensive experiments on the MS COCO dataset show that the proposed model outperforms the baselines.

[1]  Ting Yao,et al.  Comprehending and Ordering Semantics for Image Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Eneko Agirre,et al.  Principled Paraphrase Generation with Parallel Corpora , 2022, ACL.

[3]  R. Aharonov,et al.  Quality Controlled Paraphrase Generation , 2022, ACL.

[4]  Haifeng Wang,et al.  UNIMO-2: End-to-End Unified Vision-Language Grounded Learning , 2022, FINDINGS.

[5]  Mirella Lapata,et al.  Hierarchical Sketch Induction for Paraphrase Generation , 2022, ACL.

[6]  B. Bhanu,et al.  Inner Knowledge-based Img2Doc Scheme for Visual Question Answering , 2022, ACM Trans. Multim. Comput. Commun. Appl..

[7]  Dayiheng Liu,et al.  Self-supervised Product Title Rewrite for Product Listing Ads , 2022, NAACL.

[8]  V. Logacheva,et al.  A large-scale computational study of content preservation measures for text style transfer and paraphrase generation , 2022, ACL.

[9]  David J. Weir,et al.  Predicate-Argument Based Bi-Encoder for Paraphrase Identification , 2022, ACL.

[10]  Xiaojun Wan,et al.  Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach , 2021, FINDINGS.

[11]  Xuanjing Huang,et al.  TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning , 2021, IJCAI.

[12]  Ronghang Hu,et al.  UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[14]  Vinay P. Namboodiri,et al.  Revisiting Paraphrase Question Generator using Pairwise Discriminator , 2019, Neurocomputing.

[15]  Shih-Fu Chang,et al.  Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions , 2020, ArXiv.

[16]  Tanya Goyal,et al.  Neural Syntactic Preordering for Controlled Paraphrase Generation , 2020, ACL.

[17]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[18]  Harry Shum,et al.  The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[19]  Xiaojun Wan,et al.  Generating Diverse and Descriptive Image Captions Using Visual Paraphrases , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Hanqing Lu,et al.  BTDP: Toward Sparse Fusion with Block Term Decomposition Pooling for Visual Question Answering , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[21]  Tao Mei,et al.  Show, Reward, and Tell , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[22]  Yueting Zhuang,et al.  Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[25]  Sandeep Kumar,et al.  Learning Semantic Sentence Embeddings using Sequential Pair-wise Discriminator , 2018, COLING.

[26]  Xu Sun,et al.  Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation , 2018, NAACL.

[27]  Scott W. Linderman,et al.  Learning Latent Permutations with Gumbel-Sinkhorn Networks , 2018, ICLR.

[28]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Wei-Ying Ma,et al.  Topic Aware Neural Response Generation , 2016, AAAI.

[30]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[32]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[39]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[40]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[41]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[43]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  E. Whitaker,et al.  A Pedagogy to Address Plagiarism , 1993, College Composition & Communication.