Upgrading the Newsroom: An Automated Image Selection System for News Articles

We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g. German, and transferring knowledge across nearby languages. The text encoder adopts a hierarchical self-attention mechanism to attend more to both keywords within a piece of text and informative components of a news article. We extensively experiment with our system on a large-scale text-image database containing multimodal multilingual news articles collected from Swiss local news media websites. The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly-supervised learning setting. Besides, we also offer insights on the advantage of using multiple textual sources and multilingual data.

[1]  Hervé Jégou,et al.  Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion , 2018, EMNLP.

[2]  Alexander M. Rush,et al.  The Annotated Transformer , 2018 .

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[5]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Shie Mannor,et al.  Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce , 2016, AAAI 2016.

[8]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[9]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[10]  Fangyu Liu,et al.  A Strong and Robust Baseline for Text-Image Matching , 2019, ACL.

[11]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[12]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[13]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[16]  Roi Reichart,et al.  Bridging Languages through Images with Deep Partial Canonical Correlation Analysis , 2018, ACL.

[17]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[18]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[19]  Frank Keller,et al.  Image Pivoting for Learning Multilingual Multimodal Representations , 2017, EMNLP.

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Ronan Collobert,et al.  Phrase-based Image Captioning , 2015, ICML.

[22]  Xun Wang,et al.  Improved Text-Image Matching by Mitigating Visual Semantic Hubs , 2019 .

[23]  Dimitris N. Metaxas,et al.  Show Me a Story: Towards Coherent Neural Story Illustration , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[27]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[28]  Yansong Feng,et al.  Topic Models for Image Annotation and Text Illustration , 2010, HLT-NAACL.

[29]  Rodrigo C. Barros,et al.  Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[32]  Ruihua Song,et al.  Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences , 2019, ACM Multimedia.

[33]  Qi Tian,et al.  Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval , 2018, ACM Multimedia.

[34]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[35]  Yanti Idaya Aspura Mohd Khalid,et al.  A framework for integrating DBpedia in a multi-modality ontology news image retrieval system , 2011, 2011 International Conference on Semantic Technology and Information Retrieval.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Ruslan Salakhutdinov,et al.  Learning Robust Visual-Semantic Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Arnau Ramisa,et al.  Multimodal News Article Analysis , 2017, IJCAI.

[39]  Yansong Feng,et al.  How Many Words Is a Picture Worth? Automatic Caption Generation for News Images , 2010, ACL.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[44]  Wei-Ying Ma,et al.  Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[46]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[48]  Qingming Huang,et al.  Learning Fragment Self-Attention Embeddings for Image-Text Matching , 2019, ACM Multimedia.

[49]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Masakazu Iwamura,et al.  ShakeDrop regularization , 2018, ICLR.

[52]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[53]  Andrew Zisserman,et al.  On-the-fly learning for visual search of large-scale image and video datasets , 2015, International Journal of Multimedia Information Retrieval.

[54]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Balaraman Ravindran,et al.  Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning , 2015, NAACL.

[56]  Brent J. Hecht,et al.  VizByWiki: Mining Data Visualizations from the Web to Enrich News Articles , 2018, WWW.

[57]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[58]  Martin Engilberge,et al.  Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[61]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[62]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[64]  Qin Jin,et al.  Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data , 2019, AAAI.

[65]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[66]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.