Cross-modality earth mover’s distance-driven convolutional neural network for different-modality data

Cross-modality matching refers to the problem of comparing similarity/dissimilarity of a pair of data points of different modalities, such as an image and a text. Deep neural networks have been popular to represent data points of different modalities due to their ability to extract effective features. However, existing works use simple distance metrics to compare the deep features of multiple modalities, which do not fit the nature of cross-modality matching, because it imposes the features of different modalities to be of the same dimension and do not allow cross-feature matching. To solve this problem, we propose to use convolutional neural network (CNN) models with soft-max activation layer to represent a pair of different-modality data points to two histograms (not necessarily of the same dimensions) and compare their dissimilarity by using earth mover’s distance (EMD). The EMD can match the features extracted by the two CNN models of different modalities freely. Moreover, we develop a joint learning framework to learn the CNN parameters specifically for the EMD-driven comparison, supervised by the relevance/irrelevance labels of the data pairs of different modalities. The experiments over applications such as image–text retrieval, and malware detection show its advantage over existing cross-modality matching methods.

[1]  Guohui Zhang,et al.  Cross-domain attribute representation based on convolutional neural network , 2018, ICIC.

[2]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[3]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[4]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[6]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[8]  Ping Jiang,et al.  Two combined forecasting models based on singular spectrum analysis and intelligent optimized algorithm for short-term wind speed , 2016, Neural Computing and Applications.

[9]  Li Wang,et al.  Cross-model convolutional neural network for multiple modality data representation , 2016, Neural Computing and Applications.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[13]  Carlo Tomasi,et al.  The Earth Mover’s Distance , 2001 .

[14]  Tommy W. S. Chow,et al.  A coarse-to-fine framework to efficiently thwart plagiarism , 2011, Pattern Recognit..

[15]  Michael Lindenbaum,et al.  Nonnegative Matrix Factorization with Earth Mover's Distance Metric for Image Analysis , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Kenji Kita,et al.  A fast retrieval algorithm for the earth mover's distance using EMD lower bounds and the priority queue , 2009, NLPKE.

[17]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Lei Zhang,et al.  Cross-Domain Visual Matching via Generalized Similarity Measure and Feature Learning , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[21]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[22]  Haibin Ling,et al.  An Efficient Earth Mover's Distance Algorithm for Robust Histogram Comparison , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[24]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[25]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Carlo Tomasi,et al.  Perceptual metrics for image database navigation , 1999 .