Monitoring Infant's Emotional Cry in Domestic Environments Using the Capsule Network Architecture

Automated recognition of an infant’s cry from audio can be considered as a preliminary step for the applications like remote baby monitoring. In this paper, we implemented a recently introduced deep learning topology called capsule network (CapsNet) for the cry recognition problem. A capsule in the CapsNet, which is defined as a new representation, is a group of neurons whose activity vector represents the probability that the entity exists. Active capsules at one level make predictions, via transformation matrices, for the parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We employed spectrogram representations from the short segments of an audio signal as an input of the CapsNet. For experimental evaluations, we apply the proposed method on INTERSPEECH 2018 computational paralinguistics challenge (ComParE), crying sub-challenge, which is a three-class classification task using an annotated database (CRIED). Provided audio samples contains recordings from 20 healthy infants and categorized into the three classes namely neutral, fussing and crying. We show that the multi-layer CapsNet is competitive with the baseline performance on the CRIED corpus and is considerably better than a conventional convolutional net.

[1]  Onur Dikmen,et al.  Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Florian B. Pokorny,et al.  A Novel Way to Measure and Predict Development: A Heuristic Approach to Facilitate the Early Detection of Neurodevelopmental Disorders , 2017, Current Neurology and Neuroscience Reports.

[3]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[4]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[6]  Jürgen T. Geiger,et al.  Improving event detection for audio surveillance using Gabor filterbank features , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[7]  A Fort,et al.  Parametric and non-parametric estimation of speech formants: application to infant cry. , 1996, Medical engineering & physics.

[8]  Karol J. Piczak Environmental sound classification with convolutional neural networks , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[9]  Rami Cohen,et al.  Baby cry detection in domestic environment using deep learning , 2016, 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE).

[10]  L. Lagasse,et al.  Assessment of infant cry: acoustic cry analysis and parental perception. , 2005, Mental retardation and developmental disabilities research reviews.

[11]  Justin Salamon,et al.  The Implementation of Low-cost Urban Acoustic Monitoring Devices , 2016, ArXiv.

[12]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[13]  A Fort,et al.  Acoustic analysis of newborn infant cry signals. , 1998, Medical engineering & physics.

[14]  R. Cohen,et al.  Infant cry analysis and detection , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[15]  Stefanos Zafeiriou,et al.  End2You - The Imperial Toolkit for Multimodal Profiling by End-to-End Learning , 2018, ArXiv.

[16]  Justin Salamon,et al.  Feature learning with deep scattering for urban sound analysis , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[17]  Björn W. Schuller,et al.  openXBOW - Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit , 2016, J. Mach. Learn. Res..

[18]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Gaël Richard,et al.  Acoustic scene classification with matrix factorization for unsupervised feature learning , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  M. Hariharan,et al.  Automatic classification of infant cry: A review , 2012, 2012 International Conference on Biomedical Engineering (ICoBE).

[22]  Chuan-Yu Chang,et al.  Application of deep learning for recognizing infant cries , 2016, 2016 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW).

[23]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[24]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[25]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[26]  Björn W. Schuller,et al.  Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks , 2018 .

[27]  Björn W. Schuller,et al.  The INTERSPEECH 2018 Computational Paralinguistics Challenge: Atypical & Self-Assessed Affect, Crying & Heart Beats , 2018, INTERSPEECH.