论文信息 - Capturing Deep Correlations with 2-Way Nets

Capturing Deep Correlations with 2-Way Nets

We present a noval bi-directional mapping deep neural network architecture for the task of matching vectors from two data-sources. Our approach employs tied neural network channels to project two views into a common, maximally correlated, space using the euclidean loss. To achieve both maximally correlating projection we built an encoder-decoder framework composed of two parallel networks and incorporated batch-normalization layers and dropout adapted to the model at hand. We show state of the art results on a number of computer vision tasks including MNIST image matching and sentence-image matching on the flickr8k and flickr30k datasets.

Aviv Eisenschtat | Lior Wolf

[1] Lior Wolf,et al. Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Shotaro Akaho,et al. A kernel method for canonical correlation analysis , 2006, ArXiv.

[3] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5] Jeff A. Bilmes,et al. On Deep Multi-View Representation Learning , 2015, ICML.

[6] Geoffrey E. Hinton,et al. Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[7] Krystian Mikolajczyk,et al. Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Hugo Larochelle,et al. Correlational Neural Networks , 2015, Neural Computation.

[9] Wei Xu,et al. Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[10] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[11] H. Vinod. Canonical ridge and econometrics of joint production , 1976 .

[12] Lior Wolf,et al. Live Repetition Counting , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14] Marc'Aurelio Ranzato,et al. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[16] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17] Michael I. Jordan,et al. Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[18] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[19] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[20] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[21] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[22] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[23] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Lior Wolf,et al. RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[25] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Michael I. Jordan,et al. Kernel independent component analysis , 2003 .

[27] M. Kendall,et al. Kendall's Advanced Theory of Statistics: Volume 1 Distribution Theory , 1987 .

[28] Yale Song,et al. Multimodal human behavior analysis: learning correlation and interaction across modalities , 2012, ICMI '12.

[29] Paul Mineiro,et al. A Randomized Algorithm for CCA , 2014, ArXiv.

[30] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[32] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[33] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[34] Raymond D. Kent,et al. X‐ray microbeam speech production database , 1990 .

[35] Beth Logan,et al. Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[36] Horst Bischof,et al. Nonlinear Feature Extraction Using Generalized Canonical Correlation Analysis , 2001, ICANN.

[37] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[38] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39] David W. Jacobs,et al. Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[41] Ross B. Girshick,et al. Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[42] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[43] Tae-Kyun Kim,et al. Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44] P. Davies,et al. Kendall's Advanced Theory of Statistics. Volume 1. Distribution Theory , 1988 .

[45] Xiaogang Wang,et al. Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).