Audio Bird Classification with Inception-v4 extended with Time and Time-Frequency Attention Mechanisms

We present an adaptation of the deep convolutional network Inception-v4 tailored to solving bioacoustic classification problems. Bird sound classification was treated as if it were an image classification problem by a transfer learning of Inception. Inception, the state-of-the-art in image classification, was used together with an attention algorithm, to (multiscale) time-frequency representations or images of bird sounds. This has resulted in an efficient pipeline, that we call Soundception. Soundception scored highest on all tasks in the BirdClef2017 challenge. It reached 0.714 Mean Average Precision in the task that asked for classification of 1500 bird species. To our knowledge Soundception is currently the most effective model for biodiversity monitoring of complex soundscapes.