Rhythmic unit extraction and modelling for automatic language identification

This paper deals with an approach to Automatic Language Identification based on rhythmic modelling. Beside phonetics and phonotactics, rhythm is actually one of the most promising features to be considered for language identification, even if its extraction and modelling are not a straightforward issue. Actually, one of the main problems to address is what to model. In this paper, an algorithm of rhythm extraction is described: using a vowel detection algorithm, rhythmic units related to syllables are segmented. Several parameters are extracted (consonantal and vowel duration, cluster complexity) and modelled with a Gaussian Mixture. Experiments are performed on read speech for 7 languages (English, French, German, Italian, Japanese, Mandarin and Spanish) and results reach up to 86 ± 6% of correct discrimination between stress-timed mora-timed and syllable-timed classes of languages, and to 67 ± 8% percent of correct language identification on average for the 7 languages with utterances of 21 seconds. These results are commented and compared with those obtained with a standard acoustic Gaussian mixture modelling approach (88 ± 5% of correct identification for the 7-languages identification task).

[1]  P. Ladefoged A course in phonetics , 1975 .

[2]  Steven Greenberg,et al.  Incorporating information from syllable-length time scales into automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Jean Véronis,et al.  A multilingual prosodic database , 1998, ICSLP.

[4]  O. Fujimura,et al.  Syllable as a unit of speech recognition , 1975 .

[5]  Zinny S. Bond,et al.  Distinguishing Samples of Spoken Korean from Rhythmic and Regional Competitors. , 2002 .

[6]  F. Ramus Acoustic correlates of linguistic rhythm: Perspectives , 2002 .

[7]  Steven Greenberg,et al.  UNDERSTANDING SPEECH UNDERSTANDING: TOWARDS A UNIFIED THEORY OF SPEECH PERCEPTION , 1996 .

[8]  Hynek Hermansky,et al.  Segmentation of speech for speaker and language recognition , 2003, INTERSPEECH.

[9]  J. Weissenborn,et al.  Approaches to Bootstrapping: Phonological, lexical, syntactic and neurophysiological aspects of early language acquisition. Volume 1 , 2001 .

[10]  Kitazawa Shigeyoshi Kitamura Periodicity of Japanese Accent in Continuous Speech , 2002 .

[11]  Kallirroi Georgila,et al.  A continuous HMM text-independent speaker recognition system based on vowel spotting , 1997, EUROSPEECH.

[12]  Thilo Pfau,et al.  Estimating the speaking rate by vowel detection , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Susanne Burger,et al.  Syllable detection in read and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[15]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[16]  Andrew Wilson Howitt,et al.  Vowel landmark detection , 1999, EUROSPEECH.

[17]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[18]  Eric Keller,et al.  Representing Speech Rhythm. , 2001 .

[19]  D Schön,et al.  Comparison between Language and Music , 2001, Annals of the New York Academy of Sciences.

[20]  Ronald A. Cole,et al.  Perceptual benchmarks for automatic language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Ulrich Hans Frauenfelder,et al.  The syllable's role in speech segmentation , 1981 .

[22]  Anne Cutler,et al.  Prosody and the word boundary problem , 1996 .

[23]  E. Grabe,et al.  Durational variability in speech and the rhythm class hypothesis , 2005 .

[24]  W. Levelt,et al.  Do speakers have access to a mental syllabary? , 1994, Cognition.

[25]  Masahiko Komatsu,et al.  Perceptual Discrimination of Prosodic Types , 2004 .

[26]  F. Ramus,et al.  Correlates of linguistic rhythm in the speech signal , 1999, Cognition.

[27]  Jérôme Farinas,et al.  Evaluation automatique du débit de la parole sur des données multilingues spontanées , 2004 .

[28]  François Pellegrino,et al.  Speech timing and rhythmic structure in arabic dialects: a comparison of two approaches , 2004, INTERSPEECH.

[29]  R. M. Dauer Stress-timing and syllable-timing reanalyzed. , 1983 .

[30]  H. H. Clark Speech errors as linguistic evidence. , 1975 .

[31]  Jürgen Schmidhuber,et al.  Language identification from prosody without explicit features , 1999, EUROSPEECH.

[32]  F. Ramus Language discrimination by newborns: Teasing apart phonotactic, rhythmic, and intonational cues , 2002 .

[33]  T. Berg Productive and perceptual constraints on speech-error correction , 1992, Psychological research.

[34]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .

[35]  L. Shastri,et al.  SYLLABLE DETECTION AND SEGMENTATION USING TEMPORAL FLOW NEURAL NETWORKS , 1999 .

[36]  W. Jesteadt,et al.  Forward masking as a function of frequency, masker level, and signal delay. , 1982, The Journal of the Acoustical Society of America.

[37]  Alvin F. Martin,et al.  NIST 2003 language recognition evaluation , 2003, INTERSPEECH.

[38]  Marc A. Zissman,et al.  Automatic language identification , 2001, Speech Commun..

[39]  Guy J. Brown,et al.  A computational model of prosody perception , 1994, ICSLP.

[40]  Brigitte,et al.  Revisiting the Status of Speech Rhythm , 2002 .

[41]  Simon King,et al.  Using intonation to constrain language models in speech recognition , 1997, EUROSPEECH.

[42]  Antonio Galves Sonority as a basis for rhythmic class discrimination , 2002 .

[43]  François Pellegrino,et al.  Stratégies perceptuelles et identification automatique des langues , 2005 .

[44]  Nathalie Vallée,et al.  Des lexiques aux syllabes des langues du monde. Typologies, tendances et organisations structurelles , 2001 .

[45]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[46]  Jérôme Farinas,et al.  Modeling prosody for language identification on read and spontaneous speech , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[47]  Zinny S. Bond,et al.  Perceptual features of unknown foreign languages as revealed by multi-dimensional scaling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[48]  Kung-Pu Li Automatic language identification using syllabic spectral features , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Franck Ramus,et al.  Perception and acquisition of linguistic rhythm by infants , 2003, Speech Commun..

[50]  Baozong Yuan,et al.  6th International Conference on Spoken Language Processing : ICSLP 2000, Oct.16-Oct.20, 2000, Beijing International Convention Center, Beijing, China : the proceedings of the conference , 2000 .

[51]  Mark Huckvale,et al.  Improvements in Speech Synthesis , 2001 .

[52]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[53]  Jérôme Farinas,et al.  Comparison of two phonetic approaches to language identification , 1999, EUROSPEECH.

[54]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[55]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[56]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[57]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[58]  Ulrich H. Frauenfelder,et al.  Boundaries versus Onsets in Syllabic Segmentation , 2001 .

[59]  P. MacNeilage,et al.  The frame/content theory of evolution of speech production , 1998, Behavioral and Brain Sciences.

[60]  D. Abercrombie,et al.  Elements of General Phonetics , 1967 .

[61]  L. Menn,et al.  Phonological development : models, research, implications , 1994 .

[62]  Hema A. Murthy,et al.  Language identification using parallel syllable-like unit recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  D. Massaro Preperceptual images, processing time, and perceptual units in auditory perception. , 1972, Psychological review.

[64]  Régine André-Obrecht,et al.  A new statistical approach for the automatic segmentation of continuous speech signals , 1988, IEEE Trans. Acoust. Speech Signal Process..

[65]  François Pellegrino,et al.  Automatic language identification: an alternative approach to phonetic modelling , 2000, Signal Process..

[66]  Ian Maddieson,et al.  Des lexiques aux syllabes des langues du monde. Typologies et structures , 2000 .

[67]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[68]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[69]  Anne Cutler,et al.  The role of strong syllables in segmentation for lexical access , 1988 .

[70]  Bernd Möbius,et al.  Word and syllable models for German text-to-speech synthesis , 1998, SSW.

[71]  François Pellegrino,et al.  Perceptual features for the identification of Romance languages , 2000, INTERSPEECH.

[72]  Zinny S. Bond,et al.  Same talker, different language , 2000, Applied Psycholinguistics.

[73]  William M. Campbell,et al.  Acoustic, phonetic, and discriminative approaches to automatic language identification , 2003, INTERSPEECH.

[74]  Peter Ford Dominey,et al.  Neural network processing of natural language: I. Sensitivity to serial, temporal and abstract structure of language in the infant , 2000 .

[75]  P. MacNeilage,et al.  The motor core of speech: a comparison of serial organization patterns in infants and languages. , 2000, Child development.

[76]  Pierre Delattre,et al.  Syllabic features and phonic impression in English, German, French and Spanish , 1969 .

[77]  Ann Thymé-Gobbel,et al.  PROSODIC FEATURES IN AUTOMATIC LANGUAGE IDENTIFICATION REFLECT LANGUAGE TYPOLOGY , 1999 .

[78]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[79]  Jérôme Farinas,et al.  Automatic estimation of speaking rate in multilingual spontaneous speech , 2004, Speech Prosody 2004.

[80]  Peter F. MacNeilage,et al.  The Evolutionary Emergence of Language: Evolution of Speech: The Relation Between Ontogeny and Phylogeny , 2000 .

[81]  P. Denes On the Motor Theory of Speech Perception , 1965 .

[82]  Brigitte Zellner,et al.  Output requirements for a high-quality speech synthesis system: the case of disambiguation. , 1997 .

[83]  James L. Morgan,et al.  Signal to syntax : bootstrapping from speech to grammar in early acquisition , 1996 .

[84]  Ivan Kopecek,et al.  Speech Recognition and Syllable Segments , 1999, TSD.

[85]  François Pellegrino,et al.  Rhythm in read british English: interdialect variability , 2004, INTERSPEECH.

[86]  F. Ramus,et al.  Language identification with suprasegmental cues: a study based on speech resynthesis. , 1999, The Journal of the Acoustical Society of America.