Listen, Read, and Identify: Multimodal Singing Language Identification of Music

We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabilities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained several LRID-Nets with varying modality dropout configuration and tested them with various combinations of input modalities. The experiment results demonstrate that using multimodal input improves performance. The results also suggest that adopting modality dropout does not degrade the performance of the model when there are full modality inputs while enabling the model to handle missing modality cases to some extent.

[1]  Marcos Aurélio Domingues,et al.  Music4All: A New Music Database and Its Applications , 2020, 2020 International Conference on Systems, Signals and Image Processing (IWSSIP).

[2]  Linus Roxbergh Language Classification of Music Using Metadata , 2022 .

[3]  Daniel Willett,et al.  Language Identification in Vocal Music , 2006, ISMIR.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Mark B. Sandler,et al.  Automatic Tagging Using Deep Convolutional Neural Networks , 2016, ISMIR.

[6]  Yoshua Bengio,et al.  MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[7]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[8]  Juho Kim,et al.  Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras , 2017, ArXiv.

[9]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Anna M. Kruspe,et al.  Improving Singing Language Identification through i-Vector Extraction , 2014, DAFx.

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Mark Sandler,et al.  The Effects of Noisy Labels on Deep Convolutional Neural Networks for Music Tagging , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[13]  J. Stephen Downie,et al.  Survey Of Music Information Needs, Uses, And Seeking Behaviours: Preliminary Findings , 2004, ISMIR.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xavier Serra,et al.  Evaluation of CNN-based Automatic Music Tagging Models , 2020, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Hsin-Min Wang,et al.  Towards Automatic Identification Of Singing Language In Popular Music Recordings , 2004, ISMIR.

[18]  Jakob Abeßer,et al.  A GMM Approach to Singing Language Identification , 2014, Semantic Audio.

[19]  M. Sugiyama,et al.  Automatic language recognition using acoustic features , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20]  David A. Ross,et al.  Automatic Language Identification in music videos with low level audio and visual features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Hsin-Min Wang,et al.  Automatic Identification of the Sung Language in Popular Music Recordings , 2007 .