Visual Object Detector for Cow Sound Event Detection

Sound event detection (SED) is a reasonable choice in a number of application domains including cattle sheds, dense forests, or any dark environments where visual objects are usually concealed or invisible. This study presents an autonomous monitoring system based on sound characteristics developed for welfare management in large cattle farms. Two types of artificial audio datasets are prepared: the cow sound event dataset and the UrbanSound8K dataset, which are then used with various sound object detectors for real world implementation. Using a data-driven approach, a conventional convolutional neural network structure with certain improvements is first applied, and from there proceed to a two-stage visual object detection method for audio by treating acoustic signals as an RGB images. The object detection method achieves a higher quantitative evaluation score and more precise qualitative results than previous related studies. We conclude that visual object detection methods are more effective than currently-available CNN architectures for rare sound object detection. Indeed, an artificial data preparation strategy can provide a better method for addressing the problem of data scarcity and the annotation difficulties involved in rare sound event detection.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Changshui Zhang,et al.  Multi-Scale Recurrent Neural Network for Sound Event Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Seisuke Kyochi,et al.  Sound Event Detection Using Graph Laplacian Regularization Based on Event Co-occurrence , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Ankit Shah,et al.  DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.

[5]  Joonwhoan Lee,et al.  Domestic Cat Sound Classification Using Learned Features from Deep Neural Nets , 2018, Applied Sciences.

[6]  Chao Wang,et al.  R-CRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection , 2018, INTERSPEECH.

[7]  Samarjit Das,et al.  Eventness: Object Detection on Spectrograms for Temporal Localization of Audio Events , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Wenhao Ding,et al.  Adaptive Multi-Scale Detection of Acoustic Events , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Archontis Politis,et al.  Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[12]  J. L. Albright,et al.  Cattle behavior. , 1981, Journal of dairy science.

[13]  Yang Bin,et al.  Audio Events Detection and classification using extended R-FCN Approach , 2017, DCASE.

[14]  Yong Xu,et al.  Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems , 2019, ArXiv.

[15]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[16]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[17]  Yong Xu,et al.  Sound Event Detection and Time–Frequency Segmentation from Weakly Labelled Data , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Arthur Flexer,et al.  Basic filters for convolutional neural networks applied to music: Training or design? , 2017, Neural Computing and Applications.

[19]  Kevin Fauvel,et al.  Towards Sustainable Dairy Management - A Machine Learning Enhanced Method for Estrus Detection , 2019, KDD.

[20]  Tuomas Virtanen,et al.  A report on sound event detection with different binaural features , 2017, ArXiv.

[21]  Thomas Pellegrini,et al.  Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Xiangdong Wang,et al.  Guided Learning Convolution System for DCASE 2019 Task 4 , 2019, DCASE.

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[26]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[27]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[28]  J. Lee,et al.  Automatic Detection of Cow’s Oestrus in Audio Surveillance System , 2013, Asian-Australasian journal of animal sciences.

[29]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.