Mapping Acoustic Vector Space and Document Vector Space by RNN-LSTM

In this research, we propose a method of searching between different media (cross-media mapping) using deep learning (Machine learning algorithm which is developed and utilized rapidly in recent years). A recurrent neural network (RNN) is used for the network. By using the proposed method, music and lyrics can be correlated, and music can be searched using documents. By applying this model, it is possible to realize a music suggestion system that monitors human-to-human dialogue and provides appropriate BGM. In this paper, we constructed a proposal model, conducted an evaluation experiment, and confirmed the possibility of cross-media mapping.

[1]  Gregory H. Wakefield,et al.  Audio thumbnailing of popular music using chroma-based representations , 2005, IEEE Transactions on Multimedia.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  James R. Glass,et al.  Look, listen, and decode: Multimodal speech recognition with images , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[4]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.