Learning Multimodal Neural Network with Ranking Examples

To support cross-modal information retrieval, cross-modal learning to rank approaches utilize ranking examples (e.g., an example may be a text query and its corresponding ranked images) to learn appropriate ranking (similarity) function. However, the fact that each modality is represented with intrinsically different low-level features hinders these approaches from better reducing the heterogeneity-gap between the modalities and thus giving satisfactory retrieval results. In this paper, we consider learning with neural networks, from the perspective of optimizing the listwise ranking loss of the cross-modal ranking examples. The proposed model, named Cross-Modal Ranking Neural Network (CMRNN), benefits from the advance of both neural networks on learning high-level semantics and learning to rank techniques on learning ranking function, such that the learned cross-modal ranking function is implicitly embedded in the learned high-level representation for data objects with different modalities (e.g., text and imagery) to perform cross-modal retrieval directly. We compare CMRNN to existing state-of-the-art cross-modal ranking methods on two datasets and show that it achieves a better performance.