TIRA in Baidu Image Advertising

Since an image can be perceived by customers in few seconds, it is an effective medium for advertising and adored by advertisers. Baidu, as one of the lead search companies in the world, receives billions of text queries per day. How to feed attractive images to capture the customers’ attentions is the core task of Baidu image advertising. Traditionally, the query-to-image search is tackled by matching the text query with the image title. Nevertheless, title-based image search relies on high-quality image titles, which are not easy to be obtained or unavailable in some cases. A more reliable solution is to understand the image content and conduct content-based query-to-image retrieval. In this paper, we introduce a text-image cross-modal retrieval for advertising (TIRA) model, which has been launched in Baidu image advertising. The proposed TIRA is built upon the popularly used image classification model, ResNet and the recent state-of-the-art NLP model, BERT. It targets to bridge the modal gap by mapping the images and texts into the same feature space. Meanwhile, we propose to use contrast loss to train the TIRA model, which consistently outperforms existing methods based on pairwise loss or triplet loss. Since the proposed TIRA model directly conducts the content-based query-to-image and image-to-query retrieval, and does not rely on high-quality labeled titles, it significantly enhances the search flexibility. The TIRA model has been deployed in image2X and query2X frameworks of Baidu image advertising. After the launch of TIRA, it has achieved considerable improvement in click-through-rate (CTR) and cost per mille (CPM), which brings considerable revenue increase for advertisers.

[1]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[2]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[3]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[4]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[5]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[6]  Ping Li,et al.  MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search , 2019, KDD.

[7]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ping Li,et al.  SONG: Approximate Nearest Neighbor Search on GPU , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[10]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[11]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[12]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[16]  Xiaodong Chen,et al.  Combo-Attention Network for Baidu Video Advertising , 2020, KDD.

[17]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[20]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[21]  Ping Li,et al.  Towards Practical Alternating Least-Squares for CCA , 2019, NeurIPS.

[22]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[23]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[24]  P. Li,et al.  Sample Optimization For Display Advertising , 2020, CIKM.

[25]  Ping Li,et al.  Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[33]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.