News Content Completion with Location-Aware Image Selection

News, as one of the fundamental social media types, typically contains both texts and images. Image selection, which involves choosing appropriate images according to some specified contexts, is crucial for formulating good news. However, it presents two challenges: where to place images and which images to use. The difficulties associated with this wherewhich problem lie in the fact that news typically contains linguistically rich text that delivers complex information and more than one image. In this paper, we propose a novel endto-end two-stage framework to address these issues comprehensively. In the first stage, we identify key information in news by using location embeddings, which represent the local contextual information of each candidate location for image insertion. Then, in the second stage, we thoroughly examine the candidate images and select the most context-related ones to insert into each location identified in the first stage. We also introduce three insertion strategies to formulate different scenarios influencing the image selection procedure. Extensive experiments demonstrate the consistent superiority of the proposed framework in image selection.

[1]  Dimosthenis Karatzas,et al.  Good News, Everyone! Context Driven Entity-Aware Captioning for News Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  David Mimno,et al.  Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents , 2019, EMNLP.

[3]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[7]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Yang Yang,et al.  Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking , 2019, ACM Multimedia.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[11]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[12]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[15]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[16]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Lexing Xie,et al.  Transform and Tell: Entity-Aware News Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[19]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[20]  Lexing Xie,et al.  SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Karl Aberer,et al.  Upgrading the Newsroom: An Automated Image Selection System for News Articles , 2020 .

[27]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[28]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).