Zero-Shot Transfer Learning to Enhance Communication for Minimally Verbal Individuals with Autism using Naturalistic Data

We use zero-shot transfer learning with models trained on a generic database and applied to a sparsely labeled, highly individualized audio dataset for the specialized population of people with minimally verbal autism (mvASD). Using an iterative participatory design approach, we developed a framework for collecting naturalistic data, including an open-source custom app that enables real-time data labeling. We then trained LSTM models on subclasses of generic audio embeddings from the AudioSet database and applied these models to audio recordings of a young autistic boy with no spoken words. The results show the importance of machine learning to enhance translational communication technologies and reduce inequalities in unique and underserved populations. 1 Background and motivation A major challenge in deep learning is generalization, particularly for small, noisy datasets from uncurated, naturalistic domains. To build machine learning-based technology for these realistic scenarios, we need strategies that leverage existing well-validated deep learning datasets. In this paper, we introduce a novel dataset of sparsely labeled naturalistic vocalizations from a minimally verbal (mv) individual with autism spectrum disorder (ASD) that includes over 13 hours of audio. To the authors’ knowledge, this is the first dataset of its kind. We designed an approach to collect this dataset, including live labeling to denote affective states and communicate intent using a custom open-source mobile app. We then employed a zero-shot transfer learning approach [28] to adapt an LSTM model trained on subclasses of a large, generic audio database to classify an autistic child’s1 vocalizations for real-world use. There are over 800,000 people in the United States with nonverbal or minimally verbal ASD, meaning they use fewer than 20 word tokens or 0 words/word approximations, respectively [3, 2]. Previous applications of machine learning in ASD populations have focused on diagnosing ASD in naturalistic and laboratory settings [35, 23, 16, 8] or detecting a single emotional valence in laboratory or outpatient care settings using physiology [13, 14, 19, 18, 21]. Most approaches have relied on labels by professionals, like researchers and therapists. Prior work in classifying non-speech vocalization has focused on classifying typically developing infant cries by need (e.g. hunger, pain) using both humans and machines [36, 20, 26, 32, 5, 11]. Infant cries have also been used to diagnose ASD [33]. There has been extensive prior work on affect detection in speech with typical verbal content Both “person-first”[1] and “identity-first” [6] language will be utilized interchangeably in this work. Preprint. Under review. Figure 1: In the proposed platform, audio acquired in real-time is processed through a pre-trained LSTM model, and “translated” into shareable communicative content. Recent advances in audio processing with machine learning [25] make innovations like this proposed platform tenable. [30, 31, 10, 29, 24, 7], including in ASD populations with verbal abilities in task-driven [22, 4] and natural settings [27]. However, no known work to date has attempted to classify communicative content in vocalizations from non-infant children or adults who are minimally or non-verbal. The presented work was motivated by a real, vetted need. We conducted interviews and surveys with over 75 families and individuals with who had speech and language challenges and found miscommunication was reported as a major source of stress. Respondents with ASD also noted that existing communication augmentation devices were difficult to use and did not sufficiently capture affect and complex communication intent. For these individuals, affect or intent was often conveyed through non-traditional vocal communication such as hums, consonant utterances, babbling sounds. Some utterances may have specific known meanings, such as “buhbuh” for “go to the playground.” Others may have less clear mappings like loud, high-pitched squeals to indicate general frustration. The parents of minimally verbal autistic children reported that they understood their offspring’s communicative intent significantly better than others who interact with their child, like teachers and babysitters. However, a machine that could learn from or with an expert caregiver could enable visiting caregivers to understand the individual more effectively. Given the high heterogeneity of the mvASD population, it is necessary to begin with a deep case study. 2 Naturalistic dataset: collection and characterization Data collection methods Data were collected with a non-speaking autistic boy of elementary school age. After providing informed consent/assent, the participant wore a t-shirt with an inexpensive mini-camera in the front pocket. Single channel audio was recorded at 32 kHz and 16 bits per second. The participant and his family were asked to continue their regular activities (playing, running errands, etc.) during recording sessions. Early iterations of this study indicated that retrospective labeling of multi-hour videos by a caregiver was impractical – and the same labeling by the researcher introduced a bias we were trying to avoid – a custom Android app2 was developed to enable “live labeling” of affective states and communicative indicators by the caregiver in real time. The app included button-based labels like “request,” “protest,” “laughter,” and other customizable labels, and the caregiver was instructed to label the child’s vocalizations as soon as possible as they occurred. All labels from the app were timestamped and then synced to a server at the user’s discretion. As with any naturalistic or longitudinal dataset, the data acquired for this work pose a number of ethical and practical considerations that we have attempted to address. For example, the data collection methodology was developed over a 5-month iterative process with the pilot family to be inexpensive (easily replaced; deployable), comfortable, and unobtrusive. Paired with the open-source live-labeling app, this method enables a tenable solution for remote data collection – a critical feature for future participants in this underserved, geographically dispersed population – with minimal participant/caregiver burden. Privacy concerns for naturalistic audio and video remains an open and active field. In our study, participants had the ability to review and delete video/audio segments before sharing them with the research team. In the future, the final released dataset will only include a de-identified feature set (without raw video or audio) from participants who have chosen to opt-in. This app is open source and available by emailing the authors.

[1]  Z. Warren,et al.  Prevalence of Autism Spectrum Disorder Among Children Aged 8 Years — Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2014 , 2018, Morbidity and mortality weekly report. Surveillance summaries.

[2]  Azadeh Kushki,et al.  A Kalman Filtering Framework for Physiological Detection of Anxiety-Related Arousal in Children With Autism Spectrum Disorder , 2015, IEEE Transactions on Biomedical Engineering.

[3]  K. Michelsson,et al.  The identification of some specific meanings in infant vocalization , 1964, Experientia.

[4]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[6]  Sinéad Lydon,et al.  An Examination of Heart Rate During Challenging Behavior in Autism Spectrum Disorder , 2013 .

[7]  V. K. Mittal,et al.  Infant cry analysis of cry signal segments towards identifying the cry-cause factors , 2017, TENCON 2017 - 2017 IEEE Region 10 Conference.

[8]  Björn W. Schuller,et al.  Speech emotion recognition , 2018, Commun. ACM.

[9]  Guillermo Sapiro,et al.  Automatic emotion and attention analysis of young children at home: a ResearchKit autism feasibility study , 2018, npj Digital Medicine.

[10]  Lichuan Liu,et al.  Infant cry language analysis and recognition: an experimental approach , 2019, IEEE/CAA Journal of Automatica Sinica.

[11]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[12]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[13]  Fabien Ringeval,et al.  Automatic Analysis of Typical and Atypical Encoding of Spontaneous Emotion in the Voice of Children , 2016, INTERSPEECH.

[14]  Björn W. Schuller,et al.  Automatic Classification of Autistic Child Vocalisations: A Novel Database and Results , 2017, INTERSPEECH.

[15]  Léon J. M. Rothkrantz,et al.  Semantic Audiovisual Data Fusion for Automatic Emotion Recognition , 2015 .

[16]  B. Lester,et al.  Atypical Cry Acoustics in 6‐Month‐Old Infants at Risk for Autism Spectrum Disorder , 2012, Autism research : official journal of the International Society for Autism Research.

[17]  Matthew S. Goodwin,et al.  MEASURING AUTONOMIC AROUSAL DURING THERAPY , 2012 .

[18]  Roberto Rosas-Romero,et al.  Newborn cry nonlinear features extraction and classification , 2018, J. Intell. Fuzzy Syst..

[19]  Peter Washington,et al.  Mobile detection of autism through machine learning on home video: A development and prospective validation study , 2018, PLoS medicine.

[20]  Henning Reetz,et al.  Comparison of Supervised-learning Models for Infant Cry Classification / Vergleich von Klassifikationsmodellen zur Säuglingsschreianalyse , 2015 .

[21]  Horia Cucu,et al.  Automatic methods for infant cry classification , 2016, 2016 International Conference on Communications (COMM).

[22]  Erik Linstead,et al.  Applications of Supervised Machine Learning in Autism Spectrum Disorder Research: a Review , 2019, Review Journal of Autism and Developmental Disorders.

[23]  Erik Marchi,et al.  Emotion in the speech of children with autism spectrum conditions: prosody and everything else , 2012, WOCCI.

[24]  Matthew S. Goodwin,et al.  Predicting Imminent Aggression Onset in Minimally-Verbal Youth with Autism Spectrum Disorder Using Preceding Physiological Signals , 2018, PervasiveHealth.

[25]  Barry M. Lester,et al.  Physiologic Arousal to Social Stress in Children with Autism Spectrum Disorders: A Pilot Study. , 2012 .

[26]  Eric Courchesne,et al.  Naturalistic language sampling to characterize the language abilities of 3-year-olds with autism spectrum disorder , 2019, Autism : the international journal of research and practice.

[27]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[28]  Kah Phooi Seng,et al.  A new approach of audio emotion recognition , 2014, Expert Syst. Appl..

[29]  D K Oller,et al.  Automated vocal analysis of naturalistic recordings from children with autism, language delay, and typical development , 2010, Proceedings of the National Academy of Sciences.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Günes Karabulut-Kurt,et al.  Perceptual audio features for emotion detection , 2012, EURASIP J. Audio Speech Music. Process..

[32]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).