Bird Identification from Timestamped, Geotagged Audio Recordings

Large-scale biodiversity monitoring would profit from automated solutions supporting or complementing human experts and citizen scientists. To foster their development, the yearly BirdCLEF scientific challenge compares approaches for identifying bird species in recorded vocalizations. The solution described in this work is based on an ensemble of Convolutional Neural Networks (CNNs) processing a mel spectrogram combined with Multi-Layer Perceptrons (MLPs) processing the recording date, time and geographic location. In BirdCLEF 2018, it achieved a mean average precision of 0.705 in detecting 1,500 South American bird species (0.785 for the foreground species), the second best entry to the challenge.

[1]  Mario Lasseck,et al.  Improved Automatic Bird Identification through Decision Tree based Feature Selection and Bagging , 2015, CLEF.

[2]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Thomas Grill,et al.  Two convolutional neural networks for bird detection in audio signals , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[5]  Hervé Glotin,et al.  Overview of BirdCLEF 2018: Monospecies vs. Sundscape Bird Identification , 2018, CLEF.

[6]  Serge J. Belongie,et al.  Residual Networks are Exponential Ensembles of Relatively Shallow Networks , 2016, ArXiv.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Mario Lasseck Bird Song Classification in Field Recordings: Winning Solution for NIPS4B 2013 Competition * , 2013 .

[9]  James R. Foulds,et al.  A review of multi-instance learning assumptions , 2010, The Knowledge Engineering Review.

[10]  Jan Schlüter,et al.  Learning to Pinpoint Singing Voice from Weakly Labeled Examples , 2016, ISMIR.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hervé Glotin,et al.  Overview of LifeCLEF 2018: A Large-Scale Evaluation of Species Identification and Recommendation Algorithms in the Era of AI , 2018, CLEF.

[14]  Thomas Lidy,et al.  A Multi-modal Deep Neural Network approach to Bird-song Identication , 2017, CLEF.

[15]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[17]  Ronan Collobert,et al.  From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Thomas Grill,et al.  Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks , 2015, ISMIR.

[19]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[21]  Thomas Hofmann,et al.  Audio Based Bird Species Identification using Deep Learning Techniques , 2016, CLEF.

[22]  Vaibhava Goel,et al.  Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition , 2016, ArXiv.

[23]  Stefan Kahl,et al.  Large-Scale Bird Sound Classification using Convolutional Neural Networks , 2017, CLEF.

[24]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[25]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[26]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[27]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.