Semi-Supervised Audio Classification with Consistency-Based Regularization

Consistency-based semi-supervised learning methods such as the Mean Teacher method are state-of-the-art on image datasets, but have yet to be applied to audio data. Such methods encourage model predictions to be consistent on perturbed input data. In this paper, we incorporate audio-specific perturbations into the Mean Teacher algorithm and demonstrate the effectiveness of the resulting method on audio classification tasks. Specifically, we perturb audio inputs by mixing in other environmental audio clips, and leverage other training examples as sources of noise. Experiments on the Google Speech Command Dataset and UrbanSound8K Dataset show that the method can achieve comparable performance to a purely supervised approach while using only a fraction of the labels.

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Aren Jansen,et al.  Large-scale audio event discovery in one million YouTube videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[4]  Björn W. Schuller,et al.  Semi-supervised learning helps in sound event classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Gökhan Tür,et al.  Unsupervised and active learning in automatic speech recognition for call classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[8]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[9]  Tuomas Virtanen,et al.  Semi-supervised learning for musical instrument recognition , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[10]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[11]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[12]  Eduardo Coutinho,et al.  Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments , 2016, PloS one.

[13]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.