论文信息 - CLEESE: An open-source audio-transformation toolbox for data-driven experiments in speech and music cognition

CLEESE: An open-source audio-transformation toolbox for data-driven experiments in speech and music cognition

Over the past few years, the field of visual social cognition and face processing has been dramatically impacted by a series of data-driven studies employing computer-graphics tools to synthesize arbitrary meaningful facial expressions. In the auditory modality, reverse correlation is traditionally used to characterize sensory processing at the level of spectral or spectro-temporal stimulus properties, but not higher-level cognitive processing of e.g. words, sentences or music, by lack of tools able to manipulate the stimulus dimensions that are relevant for these processes. Here, we present an open-source audio-transformation toolbox, called CLEESE, able to systematically randomize the prosody/melody of existing speech and music recordings. CLEESE works by cutting recordings in small successive time segments (e.g. every successive 100 milliseconds in a spoken utterance), and applying a random parametric transformation of each segment’s pitch, duration or amplitude, using a new Python-language implementation of the phase-vocoder digital audio technique. We present here two applications of the tool to generate stimuli for studying intonation processing of interrogative vs declarative speech, and rhythm processing of sung melodies.

Marco Liuni | Juan José Burred | Jean-Julien Aucouturier | Emmanuel Ponsot | Louise Goupil

[1] T. Stivers,et al. An overview of the question-response system in American English conversation , 2010 .

[2] Hui Yu,et al. Perception-driven facial expression synthesis , 2012, Comput. Graph..

[3] Fang Liu,et al. Perception of Melodic Contour and Intonation in Autism Spectrum Disorder: Evidence From Mandarin Speakers , 2015, Journal of Autism and Developmental Disorders.

[4] P. Belin,et al. Cracking the social code of speech prosody using reverse correlation , 2018, Proceedings of the National Academy of Sciences.

[5] Andréia S. Rauber,et al. Sensory-based and higher-order operations contribute to abnormal emotional prosody processing in schizophrenia: an electrophysiological investigation , 2012, Psychological Medicine.

[6] Peter Neri,et al. How inherently noisy is human sensory processing? , 2010, Psychonomic bulletin & review.

[7] Patrick Susini,et al. Temporal loudness weights for sounds with increasing and decreasing intensity profiles. , 2013, The Journal of the Acoustical Society of America.

[8] Stephen McAdams,et al. Perceptually Salient Regions of the Modulation Power Spectrum for Musical Instrument Identification , 2017, Front. Psychol..

[9] Xiaoming Jiang,et al. The sound of confidence and doubt , 2017, Speech Commun..

[10] Virginia M Richards,et al. Auditory "bubbles": Efficient classification of the spectrotemporal modulations essential for speech intelligibility. , 2016, Journal of the Acoustical Society of America.

[11] Ole Winther,et al. Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[12] P. Schyns,et al. Superstitious Perceptions Reveal Properties of Internal Representations , 2003, Psychological science.

[13] Yi Xu,et al. Question intonation as affected by word stress and focus in English , 2007 .

[14] A. Ahumada,et al. Stimulus Features in Signal Detection , 1971 .

[15] Mark Dolson,et al. The Phase Vocoder: A Tutorial , 1986 .

[16] Jean-Julien Aucouturier,et al. Uncovering mental representations of smiled speech using reverse correlation. , 2018, The Journal of the Acoustical Society of America.

[17] B. Liu,et al. Implementation of the Digital Phase Vocoder Using the Fast Fourier Transform , 2022 .

[18] K. Naka,et al. White-Noise Analysis of a Neuron Chain: An Application of the Wiener Theory , 1972, Science.

[19] Sophie K. Scott,et al. Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations , 2010, Proceedings of the National Academy of Sciences.

[20] P Kuyper,et al. Triggered correlation. , 1968, IEEE transactions on bio-medical engineering.

[21] B. Repp. Probing the cognitive representation of musical time: Structural constraints on the perception of timing perturbations , 1992, Cognition.

[22] D. Poeppel,et al. Cortical entrainment to music and its modulation by expertise , 2015, Proceedings of the National Academy of Sciences.

[23] Axel Röbel,et al. Realistic Transformation of Facial and Vocal Smiles in Real-Time Audiovisual Streams , 2018, IEEE Transactions on Affective Computing.

[24] C. Gussenhoven. The phonology of tone and intonation , 2004 .

[25] K. Scherer,et al. FACSGen: A Tool to Synthesize Emotional Facial Expressions Through Systematic Manipulation of Facial Action Units , 2011 .

[26] Peter Essens,et al. Perception of Temporal Patterns , 1985 .

[27] Axel Röbel,et al. Phase vocoder and beyond , 2013 .

[28] P. Cook,et al. Memory for musical tempo: Additional evidence that auditory memory is absolute , 1996, Perception & psychophysics.

[29] E. Ross,et al. Attitudinal prosody: What we know and directions for future study , 2013, Neuroscience & Biobehavioral Reviews.

[30] Ronald Geluykens,et al. On the myth of rising intonation in polar questions , 1988 .

[31] Michael C. Mangini,et al. Making the ineffable explicit: estimating the information employed for face classifications , 2004, Cogn. Sci..

[32] R. Brauneis. Copyright and the World's Most Popular Song , 2010 .

[33] A M Aertsen,et al. Reverse-correlation methods in auditory research , 1983, Quarterly Reviews of Biophysics.

[34] Philippe G. Schyns,et al. Functional Smiles: Tools for Love, Sympathy, and War , 2017, Psychological science.

[35] Guillaume Lemaitre,et al. Auditory bubbles reveal sparse time-frequency cues subserving identification of musical voices and instruments , 2016 .

[36] Laurel J Trainor,et al. Listeners lengthen phrase boundaries in self-paced music. , 2016, Journal of experimental psychology. Human perception and performance.

[37] P. Ekman,et al. Pan-Cultural Elements in Facial Displays of Emotion , 1969, Science.

[38] J. Ohala,et al. An Ethological Perspective on Common Cross-Language Utilization of F₀ of Voice , 1984, Phonetica.

[39] Marc Pomplun,et al. Distorted object perception following whole-field adaptation of saccadic eye movements. , 2011, Journal of vision.

[40] Michael I. Mandel,et al. Measuring time-frequency importance functions of speech with bubble noise. , 2016, The Journal of the Acoustical Society of America.

[41] Dario L. Ringach,et al. Reverse correlation in neurophysiology , 2004, Cogn. Sci..

[42] Richard F Murray,et al. Classification images: A review. , 2011, Journal of vision.

[43] J. Haxby,et al. Data-driven approaches in the investigation of social perception , 2016, Philosophical Transactions of the Royal Society B: Biological Sciences.

[44] Drew H. Abney,et al. Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .

[45] C. Palmer. Mapping musical thought to musical performance. , 1989, Journal of experimental psychology. Human perception and performance.

[46] Alexander Todorov,et al. Reverse Correlating Social Face Perception , 2012 .

[47] Jean Laroche,et al. Improved phase vocoder time-scale modification of audio , 1999, IEEE Trans. Speech Audio Process..

[48] Tianyu T. Wang,et al. How musical expertise shapes speech perception: evidence from auditory classification images , 2015, Scientific Reports.

[49] B. Rosner,et al. Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.

[50] P. Schyns,et al. A mechanism for impaired fear recognition after amygdala damage , 2005, Nature.

[51] Oliver G. B. Garrod,et al. Dynamic Facial Expressions of Emotion Transmit an Evolving Hierarchy of Signals over Time , 2014, Current Biology.

[52] E. Newport,et al. WORD SEGMENTATION : THE ROLE OF DISTRIBUTIONAL CUES , 1996 .

[53] Adrian Fourcin,et al. Intonation processing in congenital amusia: discrimination, identification and imitation. , 2010, Brain : a journal of neurology.

[54] M. Zentner,et al. Assessing Musical Abilities Objectively: Construction and Validation of the Profile of Music Perception Skills , 2012, PloS one.

[55] W. O. Brimijoin,et al. The internal representation of vowel spectra investigated using behavioral response-triggered averaging. , 2013, The Journal of the Acoustical Society of America.

[56] Oliver G. B. Garrod,et al. Facial expressions of emotion are not culturally universal , 2012, Proceedings of the National Academy of Sciences.

[57] Frédéric Gosselin,et al. Bubbles: a technique to reveal the use of information in recognition tasks , 2001, Vision Research.

[58] D. Sauter,et al. Commonalities outweigh differences in the communication of emotions across human cultures [Letter to the editor] , 2013 .

[59] Liberty S. Hamilton,et al. Intonational speech prosody encoding in the human auditory cortex , 2017, Science.

[60] P. Laukka,et al. Communication of emotions in vocal expression and music performance: different channels, same code? , 2003, Psychological bulletin.