论文信息 - Images Don't Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank

Images Don't Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank

Search is at the heart of modern e-commerce. As a result, the task of ranking search results automatically (learning to rank) is a multibillion dollar machine learning problem. Traditional models optimize over a few hand-constructed features based on the item's text. In this paper, we introduce a multimodal learning to rank model that combines these traditional features with visual semantic features transferred from a deep convolutional neural network. In a large scale experiment using data from the online marketplace Etsy, we verify that moving to a multimodal representation significantly improves ranking quality. We show how image features can capture fine-grained style information not available in a text-only representation. In addition, we show concrete examples of how image information can successfully disentangle pairs of highly different items that are ranked similarly by a text-only model.

[1] Cordelia Schmid,et al. Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[4] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5] Jason Weston,et al. Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[6] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Yanjun Qi,et al. Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[8] Hang Li,et al. A Short Introduction to Learning to Rank , 2011, IEICE Trans. Inf. Syst..

[9] Barbara Caputo,et al. Learning Categories From Few Examples With Multi Model Knowledge Transfer , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[11] James Parker,et al. on Knowledge and Data Engineering, , 1990 .

[12] Thore Graepel,et al. Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[13] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.

[14] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[15] Andrew Zisserman,et al. Tabula rasa: Model transfer for object category detection , 2011, 2011 International Conference on Computer Vision.

[16] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[17] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18] Stefan Carlsson,et al. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19] Andrew Zisserman,et al. Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[20] Nuno Vasconcelos,et al. On the regularization of image semantics by modal expansion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Leon A. Gatys,et al. Texture Synthesis Using Convolutional Neural Networks , 2015, NIPS.

[22] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[23] Leon A. Gatys,et al. Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks , 2015, ArXiv.

[24] Gang Wang,et al. Building text features for object image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Jüri Lember,et al. Bridging Viterbi and posterior decoding: a generalized risk approach to hidden path inference based on hidden Markov models , 2014, J. Mach. Learn. Res..

[26] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27] Filip Radlinski,et al. Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[28] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.

[29] Svetlana Lazebnik,et al. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[30] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[31] Partha Pratim Talukdar,et al. Improving Product Classification Using Images , 2011, 2011 IEEE 11th International Conference on Data Mining.

[32] Gregory N. Hullender,et al. Learning to rank using gradient descent , 2005, ICML.