Automated Segmentation of Folk Songs Using Artificial Neural Networks

Two different systems are introduced, that perform automated audio annotation and segmentation of Cypriot folk songs into meaningful musical information. The first system consists of three artificial neural networks (ANNs) using timbre low-level features. The output of the three networks is classifying an unknown song as “monophonic” or “polyphonic”. The second system employs one ANN using the same feature set. This system takes as input a polyphonic song and it identifies the boundaries of the instrumental and vocal parts. For the classification of the “monophonic – polyphonic”, a precision of 0.88 and a recall of 0.78 has been achieved. For the classification of the “vocal – instrumental” a precision of 0.85 and recall of 0.83 has been achieved. From the obtained results we concluded that the timbre low-level features were able to capture the characteristics of the audio signals. Also, that the specific ANN structures were suitable for the specific classification problem and outperformed classical statistical methods.

[1]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[2]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[4]  Manabendra Bhuyan,et al.  Performance Comparison of Neural Networks and GMM for Vocal/Nonvocal segmentation for Singer Identification , 2014 .

[5]  Xavier Serra,et al.  Detecting Solo Phrases in Music Using Spectral and Pitch-related Descriptors , 2009 .

[6]  M. A. Stephens EDF Statistics for Goodness-of-Fit: Part 1 , 1972 .

[7]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[8]  Lie Lu,et al.  Digital Object Identifier (DOI) 10.1007/s00530-002-0065-0 Multimedia Systems , 2003 .

[9]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[10]  Christos Schizas,et al.  Artificial neural networks to investigate the significance of PAPPA and b-hCG for the prediction of chromosomal abnormalities , 2011, The 2011 International Joint Conference on Neural Networks.

[11]  P. Swain,et al.  Neural Network Approaches Versus Statistical Methods In Classification Of Multisource Remote Sensing Data , 1990 .

[12]  Frans Wiering,et al.  Robust Segmentation and Annotation of Folk Song Recordings , 2009, ISMIR.

[13]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[14]  Curtis Roads,et al.  The Computer Music Tutorial , 1996 .

[15]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[16]  Shankar Vembu,et al.  Separation of Vocals from Polyphonic Audio Recordings , 2005, ISMIR.

[17]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.