论文信息 - Learning vocal mode classifiers from heterogeneous data sources

Learning vocal mode classifiers from heterogeneous data sources

This paper targets on a generalized vocal mode classifier (speech/singing) that works on audio data from an arbitrary data source. Previous studies on sound classification are commonly based on cross-validation using a single dataset, without considering training-recognition mismatch. In our study, two experimental setups are used: matched training-recognition condition and mismatched training-recognition condition. In the matched condition setup, the classification performance is evaluated using cross-validation on TUT-vocal-2016. In the mismatched setup, the performance is evaluated using seven other datasets for training and TUT-vocal-2016 for testing. The experimental results demonstrate that the classification accuracy is much lower in mismatched condition (69.6%), compared to that in matched condition (95.5%). Various feature normalization methods were tested to improve the performance in the setup of mismatched training-recognition condition. The best performance (96.8%) was obtained using the proposed subdataset-wise normalization.

[1] Olli Viikki,et al. Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[2] Olli Viikki,et al. On combining vocal tract length normalisation and speaker adaptation for noise robust speech recognition , 1999, EUROSPEECH.

[3] Theodoros Giannakopoulos. pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[4] José L. Pérez-Córdoba,et al. Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[5] Hermann Ney,et al. Quantile based histogram equalization for online applications , 2002, INTERSPEECH.

[6] Yusuke Kida,et al. Voice Activity Detection: Merging Source and Filter-based Information , 2016, IEEE Signal Processing Letters.

[7] Heikki Huttunen,et al. Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[8] José Miguel Díaz-Báñez,et al. Characterization and Similarity in A Cappella Flamenco Cantes , 2010, ISMIR.

[9] Hermann Ney,et al. Enhanced histogram normalization in the acoustic feature space , 2002, INTERSPEECH.

[10] Florian Metze,et al. A comparison of Deep Learning methods for environmental sound detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] David Pearce,et al. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[12] Fred Cummins,et al. Speaker Identification Using Instantaneous Frequencies , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13] S. Molau,et al. Feature space normalization in adverse acoustic conditions , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[15] Ning Ma,et al. The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[16] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[17] Gautham J. Mysore,et al. Speaker and noise independent voice activity detection , 2013, INTERSPEECH.

[18] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[19] Matthias Mauch,et al. MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research , 2014, ISMIR.

[20] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[21] Hermann Ney,et al. Quantile based histogram equalization for noise robust large vocabulary speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Tobias Watzka,et al. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018) , 2018 .