A GMM Approach to Singing Language Identification

Automatic language identification for singing is a topic that has not received much attention for the past years. Possible application scenarios include searching for musical pieces in a certain language, improvement of similarity search algorithms for music, and improvement of regional music classification and genre classification. It could also serve to mitigate the “glass ceiling” effect. Most existing approaches employ PPRLM (Parallel Phone Recognition followed by Language Modelling) processing. Recent publications show that GMM-based (Gaussian Mixture Models) approaches are now able to produce results comparable to PPRLM systems when using certain audio features. Their advantages lie in their simplicity of implementation and the reduced training data requirements. This was only tested on speech data so far. In this paper, we therefore try out such a GMM-based approach for singing language identification. We test our system on speech data and a-capella singing. We use MFCC (Mel-Frequency Cepstral Coefficients), TRAP (Termporal Pattern), and SDC (Shifted Delta Cepstrum) features. The results are comparable to the state of the art for singing language identification, but the approach is a lot simpler to implement as no phoneme-wise annotations are required. We obtain results of 75% accuracy for speech data and 67.5% accuracy for a-capella data. To our knowledge, neither the GMM-based approach nor this feature combination have been used for the purpose of singing language identification before.

[1]  Jakob Abeßer,et al.  Automatic Classification of Musical Pieces into Global Cultural Areas , 2011, Semantic Audio.

[2]  Adam Dustor,et al.  Spoken language identification based on GMM models , 2010, ICSES 2010 International Conference on Signals and Electronic Circuits.

[3]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[4]  Daniel Willett,et al.  Language Identification in Vocal Music , 2006, ISMIR.

[5]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[6]  David A. Ross,et al.  Automatic Language Identification in music videos with low level audio and visual features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[8]  François Pachet,et al.  Improving Timbre Similarity : How high’s the sky ? , 2004 .

[9]  Eliathamby Ambikairajah,et al.  Language Identification using Warping and the Shifted Delta Cepstrum , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[10]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[11]  Hsiao-Chuan Wang,et al.  Language identification using pitch contour information , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  William M. Campbell,et al.  Acoustic, phonetic, and discriminative approaches to automatic language identification , 2003, INTERSPEECH.

[13]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[14]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[15]  Jens Kofod Hansen,et al.  Recognition Of Phonemes In A-Cappella Recordings Using Temporal Patterns And Mel Frequency Cepstral Coefficients , 2012 .

[16]  John H. L. Hansen,et al.  Language identification for singing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hsin-Min Wang,et al.  Towards Automatic Identification Of Singing Language In Popular Music Recordings , 2004, ISMIR.

[18]  M. A. Kohler,et al.  Language identification using shifted delta cepstra , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[19]  Yonghong Yan,et al.  Experiments for an approach to language identification with conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[21]  Pavel Matejka,et al.  Automatic Language Identification Using Phoneme and Automatically Derived Unit Strings , 2004, TSD.