Speaker Classification Concepts: Past, Present and Future

Speaker classification requires a sufficiently accurate functional description of speaker attributes and the resources used in speaking, to be able to produce new utterances mimicking the speaker's current physical, emotional and cognitive state, with the correct dialect, social class markers and speech habits. We lack adequate functional knowledge of why and how speakers produce the utterances they do, as well as adequate theoretical frameworks embodying the kinds of knowledge, resources and intentions they use. Rhythm and intonation - intimately linked in most language - provide a wealth of information relevant to speaker classification. Functional - as opposed to descriptive - models are needed. Segmental cues to speaker category, and markers for categories like fear, uncertainty, urgency, and confidence are largely un-researched. What Eckman and Friesen did for facial expression must be done for verbal expression. The chapter examines some potentially profitable research possibilities in context.

[1]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[2]  David R. Hill,et al.  An Experiment on the Perception of Intonational Features , 1977, Int. J. Man Mach. Stud..

[3]  Rama Chellappa,et al.  Human and machine recognition of faces: a survey , 1995, Proc. IEEE.

[4]  D. Goleman Emotional Intelligence: Why It Can Matter More Than IQ , 1995 .

[5]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Gunnar Fant,et al.  Modern instruments and methods for acoustic studies of speech , 1958 .

[7]  C. Stoel-Gammon,et al.  Cross-Language Differences in Phonological Acquisition: Swedish and American /t/ , 1994, Phonetica.

[8]  Gérard Chollet,et al.  Segmental Approaches for Automatic Speaker Verification , 2000, Digit. Signal Process..

[9]  D. Abercrombie,et al.  Elements of General Phonetics , 1967 .

[10]  Briony Williams,et al.  The question of randomness in English foot timing: a control experiment , 1994 .

[11]  Olle Engstrand,et al.  Effects of sex and age in the Arjeplog dialect: a listening test and measurements of preaspiration and VOT , 2002 .

[12]  David R. Hill,et al.  Unrestricted text-to-speech revisited: rhythm and intonation , 1992, ICSLP.

[13]  João Paulo da Silva Neto,et al.  A stream-based audio segmentation, classification and clustering pre-processing system for broadcast news using ANN models , 2005, INTERSPEECH.

[14]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[15]  F S COOPER,et al.  The interconversion of audible and visible patterns as a basis for research in the perception of speech. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Robert V. Levine Pace of Life , 2000 .

[17]  E. Abberton,et al.  Intonation and Speaker Identification , 1978, Language and speech.

[18]  J. Holmes,et al.  Speech Synthesis by Rule , 1964 .

[19]  B. Wyvill,et al.  Animating speech: an automated approach using speech synthesised by rules , 1988, The Visual Computer.

[20]  M. Halliday A course in spoken English : intonation , 1970 .

[21]  Gérard Chollet,et al.  Discrimination of voices of twins and siblings for speaker verification , 1995, EUROSPEECH.

[22]  Samy Bengio,et al.  Towards Computer Understanding of Human Interactions , 2003, EUSAI.

[23]  Ian H. Witten,et al.  Some results from a preliminary study of British English speech rhythm , 1977 .

[24]  René Carré,et al.  Distinctive regions in acoustic tubes. Speech production modelling , 1992 .

[25]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[26]  G. Jonathan,et al.  Dynamics and the Semantics of Dialogue , 1996 .

[27]  I. Mattingly Synthesis by Rule of Prosodic Features , 1966 .

[28]  Gérard Chollet,et al.  Graphical Models for Text-Independent Speaker Verification , 2004, Summer School on Neural Networks.

[29]  Paul Foulkes,et al.  The social life of phonetics and phonology , 2006, J. Phonetics.

[30]  K. Pike,et al.  The intonation of American English , 1946 .

[31]  Daniel Jones An outline of English phonetics , 1956 .

[32]  Craig-Richard Taube-Schock Synthesizing intonation for computer speech output , 1994 .

[33]  Itshak Lapidot,et al.  Resolution limitation in speakers clustering and segmentation problems , 2001, Odyssey.

[34]  A. C. Rencher,et al.  Fifty-four voices from two: the effects of simultaneous manipulations of rate, mean fundamental frequency, and variance of fundamental frequency on ratings of personality from speech. , 1974, The Journal of the Acoustical Society of America.

[35]  J. Pierrehumbert,et al.  Synthesizing intonation , 2004 .

[36]  Gérard Chollet,et al.  Neural net approaches to speaker verification: comparison with second order statistic measures , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[37]  Azriel Rosenfeld,et al.  Face recognition: A literature survey , 2003, CSUR.

[38]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[39]  D. Broadbent,et al.  Vowel judgements and adaptation level , 1960, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[40]  Herman Chernoff,et al.  The Use of Faces to Represent Points in k- Dimensional Space Graphically , 1973 .

[41]  P. Foulkes,et al.  The Emergence of Structured Variation , 2001 .

[42]  Harry Hollien,et al.  The Acoustics of Crime: The New Science of Forensic Phonetics , 1990 .

[43]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[44]  M. de Rijke,et al.  Logic, Language and Computation , 1997 .

[45]  Sylvain Meignier,et al.  Speaker utterances tying among speaker segmented audio documents using hierarchical classification: towards speaker indexing of audio databases , 2002, INTERSPEECH.

[46]  Christian A. Müller,et al.  Automatic recognition of speakers' age and gender on the basis of empirical studies , 2006, INTERSPEECH.

[47]  Paul Foulkes,et al.  Telephone speaker recognition amongst members of a close social network , 2000 .

[48]  Patricia K. Kuhl,et al.  Infants' perception and representation of speech: development of a new theory , 1992, ICSLP.

[49]  Anna Esposito,et al.  Nonlinear Speech Modeling and Applications, Advanced Lectures and Revised Selected Papers, 9th International Summer School "Neural Nets E.R. Caianiello" on Nonlinear Speech Processing: Algorithms and Analysis, Vietri sul Mare, Salerno, Italy, September 13-18, 2004 , 2005, Summer School on Neural Networks.

[50]  Nico Willems,et al.  A synthesis scheme for British English intonation , 1988 .

[51]  G. E. Peterson,et al.  Some Basic Considerations in the Analysis of Intonation , 1960 .

[52]  A. Liberman,et al.  Minimal Rules for Synthesizing Speech , 1959 .

[53]  Rolf Carlson,et al.  Experiments with voice modelling in speech synthesis , 1991, Speech Commun..

[54]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[55]  G. Allen The Location of Rhythmic Stress Beats in English: an Experimental Study I , 1972, Language and speech.

[56]  Christian A. Müller,et al.  Zweistufige kontextsensitive Sprecherklassifikation am Beispiel von Alter und Geschlecht , 2005 .

[57]  都築 正喜,et al.  Sound Spectrograph による音声の新表記法 , 1992 .

[58]  P. Ladefoged A course in phonetics , 1975 .

[59]  Ian H. Witten,et al.  Isochrony in English Speech: its Statistical Validity and Linguistic Relevance , 1984 .