Multi-instance Learning for Bipolar Disorder Diagnosis using Weakly Labelled Speech Data

While deep learning is undoubtedly the predominant learning technique across speech processing, it is still not widely used in health-based applications. The corpora available for health-style recognition problems are often small, both concerning the total amount of data available and the number of individuals present. The Bipolar Disorder corpus, used in the 2018 Audio/Visual Emotion Challenge, contains only 218 audio samples from 46 individuals. Herein, we present a multi-instance learning framework aimed at constructing more reliable deep learning-based models in such conditions. First, we segment the speech files into multiple chunks. However, the problem is that each of the individual chunks is weakly labelled, as they are annotated with the label of the corresponding speech file, but may not be indicative of that label. We then train the deep learning-based (ensemble) multi-instance learning model, aiming at solving such a weakly labelled problem. The presented results demonstrate that this approach can improve the accuracy of feedforward, recurrent, and convolutional neural nets on the 3-class mania classification tasks undertaken on the Bipolar Disorder corpus.

[1]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[2]  Björn Schuller,et al.  Deep Sequential Image Features on Acoustic Scene Classification , 2017, DCASE.

[3]  Li Deng,et al.  Deep learning: from speech recognition to language and multimodal processing , 2016, APSIPA Transactions on Signal and Information Processing.

[4]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[5]  Liang Zhao,et al.  Multi-instance Domain Adaptation for Vaccine Adverse Event Detection , 2018, WWW.

[6]  Albert Ali Salah,et al.  The Turkish Audio-Visual Bipolar Disorder Corpus , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[7]  Fabien Ringeval,et al.  AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition , 2018, AVEC@MM.

[8]  Dongmei Jiang,et al.  DCNN and DNN based multi-modal depression recognition , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[9]  Arti Rawat,et al.  Emotion Recognition through Speech Using Neural Network , 2015 .

[10]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[11]  Eduardo Coutinho,et al.  Cooperative Learning and its Application to Emotion Recognition from Speech , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Luiz Eduardo Soares de Oliveira,et al.  Multiple instance learning for histopathological breast cancer image classification , 2019, Expert Syst. Appl..

[13]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[14]  Dongmei Jiang,et al.  Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures , 2018, AVEC@MM.

[15]  Kun Qian,et al.  Deep Scalogram Representations for Acoustic Scene Classification , 2018, IEEE/CAA Journal of Automatica Sinica.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Alain Rakotomamonjy,et al.  Histogram of gradients of Time-Frequency Representations for Audio scene detection , 2015, ArXiv.

[19]  A. David Marshall,et al.  Automated Screening for Bipolar Disorder from Audio/Visual Modalities , 2018, AVEC@MM.

[20]  Zhiwei He,et al.  Multi-modality Hierarchical Recall based on GBDTs for Bipolar Disorder Classification , 2018, AVEC@MM.

[21]  Róbert Busa-Fekete,et al.  DNN-Based Feature Extraction and Classifier Combination for Child-Directed Speech, Cold and Snoring Identification , 2017, INTERSPEECH.

[22]  Alberto Cano,et al.  An ensemble approach to multi-view multi-instance learning , 2017, Knowl. Based Syst..

[23]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[24]  M. Carta,et al.  Screening for bipolar disorders: A public health issue. , 2016, Journal of affective disorders.

[25]  Björn W. Schuller,et al.  Recognising interest in conversational speech - comparing bag of frames and supra-segmental features , 2009, INTERSPEECH.

[26]  Mark D. Plumbley,et al.  Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Ya Li,et al.  Multi task sequence learning for depression scale prediction from video , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[29]  Philippe Conus,et al.  Public health significance of bipolar disorder: implications for early intervention and prevention , 2014, Bipolar disorders.

[30]  Fabien Ringeval,et al.  Speech-based Diagnosis of Autism Spectrum Condition by Generative Adversarial Network Representations , 2017, DH.

[31]  Zhi-Hua Zhou,et al.  Ensembles of Multi-instance Learners , 2003, ECML.

[32]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[33]  Yunhong Wang,et al.  Bipolar Disorder Recognition via Multi-scale Discriminative Audio Temporal Representation , 2018, AVEC@MM.

[34]  James R. Foulds,et al.  A review of multi-instance learning assumptions , 2010, The Knowledge Engineering Review.

[35]  Björn W. Schuller,et al.  The University of Passau Open Emotion Recognition System for the Multimodal Emotion Challenge , 2016, CCPR.

[36]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.