Identification of Indian languages using multi-level spectral and prosodic features

In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F0 contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.

[1]  R.A. Cole,et al.  Language identification with neural networks: a feasibility study , 1989, Conference Proceeding IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[2]  Vennila Ramalingam,et al.  A hierarchical language identification system for Indian languages , 2012, Digit. Signal Process..

[3]  K. Sreenivasa Rao,et al.  Spotting and Recognition of Consonant-Vowel Units from Continuous Speech Using Accurate Detection of Vowel Onset Points , 2012, Circuits, Systems, and Signal Processing.

[4]  Bayya Yegnanarayana,et al.  Modeling durations of syllables using neural networks , 2007, Comput. Speech Lang..

[5]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[6]  S. Maity,et al.  IITKGP-MLILSC speech database for language identification , 2012, 2012 National Conference on Communications (NCC).

[7]  Seiichi Nakagawa,et al.  Diction for phoneme/syllable/word-category and identification of language using HMM , 1990, ICSLP.

[8]  Hema A. Murthy,et al.  Language identification from short segments of speech , 2000, INTERSPEECH.

[9]  Jérôme Farinas,et al.  Rhythmic unit extraction and modelling for automatic language identification , 2005, Speech Commun..

[10]  B. Yegnanarayana,et al.  Recognition of Stop-Consonant-Vowel (SCV) segments in continuous speech using neural network models , 1996 .

[11]  Hsiao-Chuan Wang,et al.  Language Identification Using Pitch Contour Information in the Ergodic Markov Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[13]  PERI BHASKARARAO Salient phonetic features of Indian languages in speech technology , 2011 .

[14]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[15]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[16]  B. Yegnanarayana,et al.  Autoassociative neural network models for language identification , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[17]  Jérôme Farinas,et al.  Comparison of two phonetic approaches to language identification , 1999, EUROSPEECH.

[18]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[19]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Y Xu,et al.  Consistency of Tone-Syllable Alignment across Different Syllable Structures and Speaking Rates , 1998, Phonetica.

[21]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[22]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features , 2012, Int. J. Speech Technol..

[23]  Russell B. Ives,et al.  Development of an automatic identification system of spoken languages: Phase I , 1982, ICASSP.

[24]  Prahallad Kishore,et al.  A simple approach for building transliteration editors for Indian languages , 2005 .

[25]  K. Sreenivasa Rao,et al.  Voice conversion by mapping the speaker-specific features using pitch synchronous approach , 2010, Comput. Speech Lang..

[26]  F. Ramus,et al.  Correlates of linguistic rhythm in the speech signal , 1999, Cognition.

[27]  K. Sreenivasa Rao,et al.  Non-uniform time scale modification using instants of significant excitation and vowel onset points , 2013, Speech Commun..

[28]  Bayya Yegnanarayana,et al.  Prosody modification using instants of significant excitation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Paul Dalsgaard,et al.  Identification of mono- and poly-phonemes using acoustic-phonetic features derived by a self-organising neural network , 1992, ICSLP.

[30]  K. Sreenivasa Rao,et al.  Vowel Onset Point Detection for Low Bit Rate Coded Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  R. Krakow Physiological organization of syllables: a review , 1999 .

[32]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  K. Sreenivasa Rao,et al.  Application of prosody models for developing speech systems in Indian languages , 2011, Int. J. Speech Technol..

[34]  Lukás Burget,et al.  iVector-based prosodic system for language identification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Herbert Gish,et al.  Discriminatively Trained GMMs for Language Classification Using Boosting Methods , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Jean-Luc Gauvain,et al.  Language identification with language-independent acoustic models , 1997, EUROSPEECH.

[37]  S. Öhman Coarticulation in VCV Utterances: Spectrographic Measurements , 1966 .

[38]  Carol R. Ember,et al.  Cross‐Language Predictors of Consonant‐Vowel Syllables , 1999 .

[39]  Fred Cummins,et al.  Comparing Prosody Across Many Languages , 1999 .

[40]  Sandra E. Hutchins,et al.  On using prosodic cues in automatic language identification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[41]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.

[42]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[43]  H. H. Rump,et al.  The perceptual prominence of fundamental frequency peaks. , 1997, The Journal of the Acoustical Society of America.

[44]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[45]  George R. Doddington,et al.  Automatic Language Identification. , 1974 .

[46]  P. MacNeilage,et al.  The frame/content theory of evolution of speech production , 1998, Behavioral and Brain Sciences.

[47]  B. Yegnanarayana,et al.  Neural network classifiers for language identification using phonotactic and prosodic features , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[48]  John H. L. Hansen,et al.  Automatic language analysis and identification based on speech production knowledge , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Herbert Gish,et al.  Discriminatively trained Language Models using Support Vector Machines for Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[50]  V. Ramu Reddy,et al.  Pitch synchronous and glottal closure based speech analysis for language recognition , 2013, Int. J. Speech Technol..

[51]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[52]  Bayya Yegnanarayana,et al.  Intonation modeling for Indian languages , 2009, Comput. Speech Lang..

[53]  Timothy J. Hazen,et al.  Segment-based automatic language identification , 1997 .

[54]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[55]  François Pellegrino,et al.  An unsupervised approach to language identification , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[56]  Jirí Navrátil,et al.  Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[57]  Haizhou Li,et al.  Language Identification: A Tutorial , 2011, IEEE Circuits and Systems Magazine.

[58]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  Bayya Yegnanarayana,et al.  Duration modification using glottal closure instants and vowel onset points , 2009, Speech Commun..

[60]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61]  K. Sreenivasa Rao,et al.  Vowel onset point detection for noisy speech using spectral energy at formant frequencies , 2013, Int. J. Speech Technol..

[62]  Ronald A. Cole,et al.  The OGI 22 language telephone speech corpus , 1995, EUROSPEECH.

[63]  V. Ramasubramanian,et al.  Language identification using parallel sub-word recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[64]  F. Ramus,et al.  Language identification with suprasegmental cues: a study based on speech resynthesis. , 1999, The Journal of the Acoustical Society of America.

[65]  Anne Cutler,et al.  Prosody: Models and measurements , 1983 .

[66]  Jean-Luc Rouas Automatic Prosodic Variations Modeling for Language and Dialect Discrimination , 2007, IEEE Transactions on Audio, Speech, and Language Processing.