CLEESE: An open-source audio-transformation toolbox for data-driven experiments in speech and music cognition

Over the past few years, the field of visual social cognition and face processing has been dramatically impacted by a series of data-driven studies employing computer-graphics tools to synthesize arbitrary meaningful facial expressions. In the auditory modality, reverse correlation is traditionally used to characterize sensory processing at the level of spectral or spectro-temporal stimulus properties, but not higher-level cognitive processing of e.g. words, sentences or music, by lack of tools able to manipulate the stimulus dimensions that are relevant for these processes. Here, we present an open-source audio-transformation toolbox, called CLEESE, able to systematically randomize the prosody/melody of existing speech and music recordings. CLEESE works by cutting recordings in small successive time segments (e.g. every successive 100 milliseconds in a spoken utterance), and applying a random parametric transformation of each segment’s pitch, duration or amplitude, using a new Python-language implementation of the phase-vocoder digital audio technique. We present here two applications of the tool to generate stimuli for studying intonation processing of interrogative vs declarative speech, and rhythm processing of sung melodies.

[1]  T. Stivers,et al.  An overview of the question-response system in American English conversation , 2010 .

[2]  Hui Yu,et al.  Perception-driven facial expression synthesis , 2012, Comput. Graph..

[3]  Fang Liu,et al.  Perception of Melodic Contour and Intonation in Autism Spectrum Disorder: Evidence From Mandarin Speakers , 2015, Journal of Autism and Developmental Disorders.

[4]  P. Belin,et al.  Cracking the social code of speech prosody using reverse correlation , 2018, Proceedings of the National Academy of Sciences.

[5]  Andréia S. Rauber,et al.  Sensory-based and higher-order operations contribute to abnormal emotional prosody processing in schizophrenia: an electrophysiological investigation , 2012, Psychological Medicine.

[6]  Peter Neri,et al.  How inherently noisy is human sensory processing? , 2010, Psychonomic bulletin & review.

[7]  Patrick Susini,et al.  Temporal loudness weights for sounds with increasing and decreasing intensity profiles. , 2013, The Journal of the Acoustical Society of America.

[8]  Stephen McAdams,et al.  Perceptually Salient Regions of the Modulation Power Spectrum for Musical Instrument Identification , 2017, Front. Psychol..

[9]  Xiaoming Jiang,et al.  The sound of confidence and doubt , 2017, Speech Commun..

[10]  Virginia M Richards,et al.  Auditory "bubbles": Efficient classification of the spectrotemporal modulations essential for speech intelligibility. , 2016, Journal of the Acoustical Society of America.

[11]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[12]  P. Schyns,et al.  Superstitious Perceptions Reveal Properties of Internal Representations , 2003, Psychological science.

[13]  Yi Xu,et al.  Question intonation as affected by word stress and focus in English , 2007 .

[14]  A. Ahumada,et al.  Stimulus Features in Signal Detection , 1971 .

[15]  Mark Dolson,et al.  The Phase Vocoder: A Tutorial , 1986 .

[16]  Jean-Julien Aucouturier,et al.  Uncovering mental representations of smiled speech using reverse correlation. , 2018, The Journal of the Acoustical Society of America.

[17]  B. Liu,et al.  Implementation of the Digital Phase Vocoder Using the Fast Fourier Transform , 2022 .

[18]  K. Naka,et al.  White-Noise Analysis of a Neuron Chain: An Application of the Wiener Theory , 1972, Science.

[19]  Sophie K. Scott,et al.  Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations , 2010, Proceedings of the National Academy of Sciences.

[20]  P Kuyper,et al.  Triggered correlation. , 1968, IEEE transactions on bio-medical engineering.

[21]  B. Repp Probing the cognitive representation of musical time: Structural constraints on the perception of timing perturbations , 1992, Cognition.

[22]  D. Poeppel,et al.  Cortical entrainment to music and its modulation by expertise , 2015, Proceedings of the National Academy of Sciences.

[23]  Axel Röbel,et al.  Realistic Transformation of Facial and Vocal Smiles in Real-Time Audiovisual Streams , 2018, IEEE Transactions on Affective Computing.

[24]  C. Gussenhoven The phonology of tone and intonation , 2004 .

[25]  K. Scherer,et al.  FACSGen: A Tool to Synthesize Emotional Facial Expressions Through Systematic Manipulation of Facial Action Units , 2011 .

[26]  Peter Essens,et al.  Perception of Temporal Patterns , 1985 .

[27]  Axel Röbel,et al.  Phase vocoder and beyond , 2013 .

[28]  P. Cook,et al.  Memory for musical tempo: Additional evidence that auditory memory is absolute , 1996, Perception & psychophysics.

[29]  E. Ross,et al.  Attitudinal prosody: What we know and directions for future study , 2013, Neuroscience & Biobehavioral Reviews.

[30]  Ronald Geluykens,et al.  On the myth of rising intonation in polar questions , 1988 .

[31]  Michael C. Mangini,et al.  Making the ineffable explicit: estimating the information employed for face classifications , 2004, Cogn. Sci..

[32]  R. Brauneis Copyright and the World's Most Popular Song , 2010 .

[33]  A M Aertsen,et al.  Reverse-correlation methods in auditory research , 1983, Quarterly Reviews of Biophysics.

[34]  Philippe G. Schyns,et al.  Functional Smiles: Tools for Love, Sympathy, and War , 2017, Psychological science.

[35]  Guillaume Lemaitre,et al.  Auditory bubbles reveal sparse time-frequency cues subserving identification of musical voices and instruments , 2016 .

[36]  Laurel J Trainor,et al.  Listeners lengthen phrase boundaries in self-paced music. , 2016, Journal of experimental psychology. Human perception and performance.

[37]  P. Ekman,et al.  Pan-Cultural Elements in Facial Displays of Emotion , 1969, Science.

[38]  J. Ohala,et al.  An Ethological Perspective on Common Cross-Language Utilization of F₀ of Voice , 1984, Phonetica.

[39]  Marc Pomplun,et al.  Distorted object perception following whole-field adaptation of saccadic eye movements. , 2011, Journal of vision.

[40]  Michael I. Mandel,et al.  Measuring time-frequency importance functions of speech with bubble noise. , 2016, The Journal of the Acoustical Society of America.

[41]  Dario L. Ringach,et al.  Reverse correlation in neurophysiology , 2004, Cogn. Sci..

[42]  Richard F Murray,et al.  Classification images: A review. , 2011, Journal of vision.

[43]  J. Haxby,et al.  Data-driven approaches in the investigation of social perception , 2016, Philosophical Transactions of the Royal Society B: Biological Sciences.

[44]  Drew H. Abney,et al.  Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .

[45]  C. Palmer Mapping musical thought to musical performance. , 1989, Journal of experimental psychology. Human perception and performance.

[46]  Alexander Todorov,et al.  Reverse Correlating Social Face Perception , 2012 .

[47]  Jean Laroche,et al.  Improved phase vocoder time-scale modification of audio , 1999, IEEE Trans. Speech Audio Process..

[48]  Tianyu T. Wang,et al.  How musical expertise shapes speech perception: evidence from auditory classification images , 2015, Scientific Reports.

[49]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.

[50]  P. Schyns,et al.  A mechanism for impaired fear recognition after amygdala damage , 2005, Nature.

[51]  Oliver G. B. Garrod,et al.  Dynamic Facial Expressions of Emotion Transmit an Evolving Hierarchy of Signals over Time , 2014, Current Biology.

[52]  E. Newport,et al.  WORD SEGMENTATION : THE ROLE OF DISTRIBUTIONAL CUES , 1996 .

[53]  Adrian Fourcin,et al.  Intonation processing in congenital amusia: discrimination, identification and imitation. , 2010, Brain : a journal of neurology.

[54]  M. Zentner,et al.  Assessing Musical Abilities Objectively: Construction and Validation of the Profile of Music Perception Skills , 2012, PloS one.

[55]  W. O. Brimijoin,et al.  The internal representation of vowel spectra investigated using behavioral response-triggered averaging. , 2013, The Journal of the Acoustical Society of America.

[56]  Oliver G. B. Garrod,et al.  Facial expressions of emotion are not culturally universal , 2012, Proceedings of the National Academy of Sciences.

[57]  Frédéric Gosselin,et al.  Bubbles: a technique to reveal the use of information in recognition tasks , 2001, Vision Research.

[58]  D. Sauter,et al.  Commonalities outweigh differences in the communication of emotions across human cultures [Letter to the editor] , 2013 .

[59]  Liberty S. Hamilton,et al.  Intonational speech prosody encoding in the human auditory cortex , 2017, Science.

[60]  P. Laukka,et al.  Communication of emotions in vocal expression and music performance: different channels, same code? , 2003, Psychological bulletin.