论文信息 - Singing Voice Detection Using Multi-Feature Deep Fusion with CNN

Singing Voice Detection Using Multi-Feature Deep Fusion with CNN

The problem of singing voice detection is to segment a song into vocal and non-vocal parts. Commonly used methods usually train a model on a set of frame-based features and then predict the unknown frames by the model. However, the multi-dimensional features are usually concatenated together for each frame, with little consideration of spatial information. Hence, a deep fusion method of the Multi-feature dimensions with Convolution Neural Networks (CNN) is proposed. A one dimension convolution is made on feature dimensions for each frames, then the high-level features obtained can be used for a direct binary classification. The performance of the proposed method is on par with the state-of-art methods on public dataset.

[1] Daniel P. W. Ellis,et al. Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2] Harshita Gupta,et al. LPC and LPCC method of feature extraction in Speech Recognition System , 2016, 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence).

[3] Perfecto Herrera,et al. Comparing audio descriptors for singing voice detection in music audio files , 2007 .

[4] Daniel P. W. Ellis,et al. Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[5] Youngmoo E. Kim,et al. Singer Identification in Popular Music using Warped Linear Prediction , 2002, ISMIR.

[6] Meinard Müller,et al. Making chroma features more robust to timbre changes , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Derry Fitzgerald. Vocal separation using nearest neighbours and median filtering , 2012 .

[8] Thomas Grill,et al. Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[9] Shankar Vembu,et al. Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[10] José Miguel Díaz-Báñez,et al. Unsupervised singing voice detection using dictionary learning , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[11] Roland Badeau,et al. Singing voice detection with deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Gaël Richard,et al. Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Bryan Pardo,et al. Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[14] Gerhard Widmer,et al. Towards Light-Weight, Real-Time-Capable Singing Voice Detection , 2013, ISMIR.

[15] DeLiang Wang,et al. A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Gerhard Widmer,et al. On the reduction of false positives in singing voice detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Björn W. Schuller,et al. Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Geoffroy Peeters,et al. Singing voice detection in music tracks using direct voice vibrato detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19] Hiromasa Fujihara,et al. Timbre and Melody Features for the Recognition of Vocal Activity and Instrumental Solos in Polyphonic Music , 2011, ISMIR.

[20] Gerhard Widmer,et al. A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).