Generalizing AUC Optimization to Multiclass Classification for Audio Segmentation With Limited Training Data

Area under the ROC curve (AUC) optimisation techniques developed for neural networks have recently demonstrated their capabilities in different audio and speech related tasks. However, due to its intrinsic nature, AUC optimisation has focused only on binary tasks so far. In this paper, we introduce an extension to the AUC optimisation framework so that it can be easily applied to an arbitrary number of classes, aiming to overcome the issues derived from training data limitations in deep learning solutions. Building upon the multiclass definitions of the AUC metric found in the literature, we define two new training objectives using a one-versus-one and a one-versus-rest approach. In order to demonstrate its potential, we apply them in an audio segmentation task with limited training data that aims to differentiate 3 classes: foreground music, background music and no music. Experimental results show that our proposal can improve the performance of audio segmentation systems significantly compared to traditional training criteria such as cross entropy.

[1]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[2]  Richard M. Stern,et al.  Optimization of the DET curve in speaker verification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[3]  Xiao-Lei Zhang,et al.  Speaker Verification by Partial AUC Optimization With Mahalanobis Distance Metric Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Eduardo Lleida,et al.  Optimization of the Area Under the ROC Curve using Neural Network Supervectors for Text-Dependent Speaker Verification , 2019, Comput. Speech Lang..

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Jean Carrive,et al.  Investigating the Use of Semi-Supervised Convolutional Neural Network Models for Speech/Music Classification and Segmentation , 2017, MMEDIA 2017.

[8]  Eduardo Lleida,et al.  Multiclass audio segmentation based on recurrent neural networks for broadcast domain data , 2020, EURASIP Journal on Audio, Speech, and Music Processing.

[9]  Doroteo Torre Toledano,et al.  Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset , 2019, EURASIP J. Audio Speech Music. Process..

[10]  Wei-Qiang Zhang,et al.  An adapted data selection for deep learning-based audio segmentation in multi-genre broadcast channel , 2018, Digit. Signal Process..

[11]  Eduardo Lleida,et al.  A Recurrent Neural Network Approach to Audio Segmentation for Broadcast Domain Data , 2018, IberSPEECH.

[12]  Steven J. Mullen,et al.  Multiclass ROC Analysis , 2009 .

[13]  Eduardo Lleida,et al.  Partial AUC Optimisation Using Recurrent Neural Networks for Music Detection with Limited Training Data , 2020, INTERSPEECH.

[14]  Xiao-Lei Zhang,et al.  Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[16]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[17]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  David Page,et al.  AUCμ: A Performance Metric for Multi-Class Machine Learning Models , 2019, ICML.

[19]  Kar-Ann Toh,et al.  Maximizing area under ROC curve for biometric scores fusion , 2008, Pattern Recognit..

[20]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[21]  G. H. Wakefield,et al.  To catch a chorus: using chroma-based representations for audio thumbnailing , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[22]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[23]  Emilio Molina,et al.  Open Broadcast Media Audio from TV: A Dataset of TV Broadcast Audio with Relative Music Loudness Annotations , 2019, Trans. Int. Soc. Music. Inf. Retr..

[24]  Mert Bay,et al.  The Music Information Retrieval Evaluation eXchange: Some Observations and Insights , 2010, Advances in Music Information Retrieval.

[25]  Susanto Rahardja,et al.  Detecting Musical Sounds in Broadcast Audio Based on Pitch Tuning Analysis , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[26]  Susanto Rahardja,et al.  AUC Optimization for Deep Learning Based Voice Activity Detection , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  øöö Blockinøø Well-Trained PETs : Improving Probability Estimation , 2000 .

[28]  John H. L. Hansen,et al.  Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Oh-Wook Kwon,et al.  Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel , 2019, EURASIP Journal on Audio, Speech, and Music Processing.