Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning

Automatic species classification of birds from their sound is a computational tool of increasing importance in ecology, conservation monitoring and vocal communication studies. To make classification useful in practice, it is crucial to improve its accuracy while ensuring that it can run at big data scales. Many approaches use acoustic measures based on spectrogram-type data, such as the Mel-frequency cepstral coefficient (MFCC) features which represent a manually-designed summary of spectral information. However, recent work in machine learning has demonstrated that features learnt automatically from data can often outperform manually-designed feature transforms. Feature learning can be performed at large scale and “unsupervised”, meaning it requires no manual data labelling, yet it can improve performance on “supervised” tasks such as classification. In this work we introduce a technique for feature learning from large volumes of bird sound recordings, inspired by techniques that have proven useful in other domains. We experimentally compare twelve different feature representations derived from the Mel spectrum (of which six use this technique), using four large and diverse databases of bird vocalisations, classified using a random forest classifier. We demonstrate that in our classification tasks, MFCCs can often lead to worse performance than the raw Mel spectral data from which they are derived. Conversely, we demonstrate that unsupervised feature learning provides a substantial boost over MFCCs and Mel spectra without adding computational complexity after the model has been trained. The boost is particularly notable for single-label classification tasks at large scale. The spectro-temporal activations learned through our procedure resemble spectro-temporal receptive fields calculated from avian primary auditory forebrain. However, for one of our datasets, which contains substantial audio data but few annotations, increased performance is not discernible. We study the interaction between dataset characteristics and choice of feature representation through further empirical analysis.

[1]  Juha T. Tanttu,et al.  Wavelets in Recognition of Bird Sounds , 2007, EURASIP J. Adv. Signal Process..

[2]  Benjamin Schrauwen,et al.  Multiscale Approaches To Music Audio Feature Learning , 2013, ISMIR.

[3]  Mark D. Plumbley,et al.  Large‐scale analysis of frequency modulation in birdsong data bases , 2013, ArXiv.

[4]  Paola Laiolo,et al.  The emerging significance of bioacoustics in animal species conservation , 2010 .

[5]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[6]  D Margoliash,et al.  Template-based automatic recognition of birdsong syllables from continuous recordings. , 1996, The Journal of the Acoustical Society of America.

[7]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[8]  T. Mitchell Aide,et al.  Real-time bioacoustics monitoring and automated species identification , 2013, PeerJ.

[9]  Dan Stowell,et al.  Feature design for multilabel bird song classification in noise ( NIPS 4 B challenge ) , 2013 .

[10]  Ken Ito,et al.  Dynamic programming matching as a simulation of budgerigar contact-call discrimination , 1999 .

[11]  Xiaoli Z. Fern,et al.  Audio Classification of Bird Species: A Statistical Manifold Approach , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[12]  Chin-Chuan Han,et al.  Automatic Classification of Bird Species From Their Sounds Using Two-Dimensional Cepstral Coefficients , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Frédéric E. Theunissen,et al.  Auditory processing of vocal sounds in birds , 2006, Current Opinion in Neurobiology.

[14]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[15]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[16]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[17]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[18]  Xiaoli Z. Fern,et al.  A Syllable-Level Probabilistic Framework for Bird Species Identification , 2009, 2009 International Conference on Machine Learning and Applications.

[19]  I. Potamitis Automatic Classification of a Taxon-Rich Community Recorded in the Wild , 2014, PloS one.

[20]  Bruno A Olshausen,et al.  Sparse coding of sensory inputs , 2004, Current Opinion in Neurobiology.

[21]  Theodoros Damoulas,et al.  Bayesian Classification of Flight Calls with a Novel Dynamic Time Warping Kernel , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[22]  Elizabeth Fox,et al.  Call-independent identification in birds , 2008 .

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  D. Bates,et al.  Linear Mixed-Effects Models using 'Eigen' and S4 , 2015 .

[25]  Martine Hausberger,et al.  Neuronal bases of categorization in starling song , 2000, Behavioural Brain Research.

[26]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Mark D. Plumbley,et al.  Fast Dictionary Learning for Sparse Representations of Speech Signals , 2011, IEEE Journal of Selected Topics in Signal Processing.

[29]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[30]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[31]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[32]  Héctor Corrada Bravo,et al.  Automated classification of bird and amphibian calls using machine learning: A comparison of methods , 2009, Ecol. Informatics.

[33]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[34]  Mark D. Plumbley,et al.  Birdsong and C4DM: A survey of UK birdsong and machine recognition for music researchers , 2011 .

[35]  Gábor Fodor The Ninth Annual MLSP Competition: First place , 2013, 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[36]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  R. Ranft Natural sound archives: past, present and future. , 2004, Anais da Academia Brasileira de Ciencias.

[38]  Sanjoy Dasgupta,et al.  More like this: machine learning approaches to music similarity , 2012 .

[39]  H. C. Card,et al.  Birdsong recognition using backpropagation and multivariate statistics , 1997, IEEE Trans. Signal Process..

[40]  Xiaoli Z. Fern,et al.  Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach. , 2012, The Journal of the Acoustical Society of America.

[41]  Michael Towsey,et al.  A practical comparison of manual and autonomous methods for acoustic monitoring , 2013 .