Singing Voice Detection Using Multi-Feature Deep Fusion with CNN

The problem of singing voice detection is to segment a song into vocal and non-vocal parts. Commonly used methods usually train a model on a set of frame-based features and then predict the unknown frames by the model. However, the multi-dimensional features are usually concatenated together for each frame, with little consideration of spatial information. Hence, a deep fusion method of the Multi-feature dimensions with Convolution Neural Networks (CNN) is proposed. A one dimension convolution is made on feature dimensions for each frames, then the high-level features obtained can be used for a direct binary classification. The performance of the proposed method is on par with the state-of-art methods on public dataset.

[1]  Daniel P. W. Ellis,et al.  Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[2]  Harshita Gupta,et al.  LPC and LPCC method of feature extraction in Speech Recognition System , 2016, 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence).

[3]  Perfecto Herrera,et al.  Comparing audio descriptors for singing voice detection in music audio files , 2007 .

[4]  Daniel P. W. Ellis,et al.  Melody Extraction from Polyphonic Music Signals: Approaches, applications, and challenges , 2014, IEEE Signal Processing Magazine.

[5]  Youngmoo E. Kim,et al.  Singer Identification in Popular Music using Warped Linear Prediction , 2002, ISMIR.

[6]  Meinard Müller,et al.  Making chroma features more robust to timbre changes , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Derry Fitzgerald Vocal separation using nearest neighbours and median filtering , 2012 .

[8]  Thomas Grill,et al.  Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[9]  Shankar Vembu,et al.  Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[10]  José Miguel Díaz-Báñez,et al.  Unsupervised singing voice detection using dictionary learning , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[11]  Roland Badeau,et al.  Singing voice detection with deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Gaël Richard,et al.  Vocal detection in music with support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Bryan Pardo,et al.  Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[14]  Gerhard Widmer,et al.  Towards Light-Weight, Real-Time-Capable Singing Voice Detection , 2013, ISMIR.

[15]  DeLiang Wang,et al.  A Tandem Algorithm for Singing Pitch Extraction and Voice Separation From Music Accompaniment , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Gerhard Widmer,et al.  On the reduction of false positives in singing voice detection , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Geoffroy Peeters,et al.  Singing voice detection in music tracks using direct voice vibrato detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Hiromasa Fujihara,et al.  Timbre and Melody Features for the Recognition of Vocal Activity and Instrumental Solos in Polyphonic Music , 2011, ISMIR.

[20]  Gerhard Widmer,et al.  A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).