Evaluating Speech, Face, Emotion and Body Movement Time-series Features for Automated Multimodal Presentation Scoring

We analyze how fusing features obtained from different multimodal data streams such as speech, face, body movement and emotion tracks can be applied to the scoring of multimodal presentations. We compute both time-aggregated and time-series based features from these data streams--the former being statistical functionals and other cumulative features computed over the entire time series, while the latter, dubbed histograms of cooccurrences, capture how different prototypical body posture or facial configurations co-occur within different time-lags of each other over the evolution of the multimodal, multivariate time series. We examine the relative utility of these features, along with curated speech stream features in predicting human-rated scores of multiple aspects of presentation proficiency. We find that different modalities are useful in predicting different aspects, even outperforming a naive human inter-rater agreement baseline for a subset of the aspects analyzed.

[1]  Daniel Gildea,et al.  Automated prediction and analysis of job interview performance: The role of what you say and how you say it , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[2]  Ashish Kapoor,et al.  Multimodal affect recognition in learning environments , 2005, ACM Multimedia.

[3]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[4]  Junji Yamato,et al.  Inferring mood in ubiquitous conversational video , 2013, MUM.

[5]  Daniel Gatica-Perez,et al.  Hire me: Computational Inference of Hirability in Employment Interviews Based on Nonverbal Behavior , 2014, IEEE Transactions on Multimedia.

[6]  Hugo Van hamme,et al.  HAC-models: a novel approach to continuous speech recognition , 2008, INTERSPEECH.

[7]  Lisa M. Schreiber,et al.  The Development and Test of the Public Speaking Competence Rubric , 2012 .

[8]  Su-Youn Yoon,et al.  Application of Structural Events Detected on ASR Outputs for Automated Speaking Assessment , 2012, INTERSPEECH.

[9]  Su-Youn Yoon,et al.  Acoustic Feature-based Non-scorable Response Detection for an Automated Speaking Proficiency Assessment , 2012, INTERSPEECH.

[10]  Daniel Jurafsky,et al.  Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates , 2013, Comput. Speech Lang..

[11]  Hugo Van hamme,et al.  Unsupervised learning of time-frequency patches as a noise-robust representation of speech , 2009, Speech Commun..

[12]  Lei Chen,et al.  Applying Rhythm Features to Automatically Assess Non-Native Speech , 2011, INTERSPEECH.

[13]  Silke M. Witt,et al.  Use of speech recognition in computer-assisted language learning , 2000 .

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Xiaoming Xi,et al.  Towards Using Structural Events To Assess Non-native Speech , 2010 .

[16]  Xiaoming Xi,et al.  Improved pronunciation features for construct-driven assessment of non-native spontaneous speech , 2009, HLT-NAACL.

[17]  Lei Chen,et al.  Towards Automated Assessment of Public Speaking Skills Using Multimodal Cues , 2014, ICMI.

[18]  Shrikanth S. Narayanan,et al.  Directly data-derived articulatory gesture-like representations retain discriminatory information about phone categories , 2016, Comput. Speech Lang..

[19]  Nadia Mana,et al.  Multimodal recognition of personality traits in social interactions , 2008, ICMI '08.

[20]  Xiaoming Xi,et al.  Automatic scoring of non-native spontaneous speech in tests of spoken English , 2009, Speech Commun..

[21]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[22]  Xiaoming Xi,et al.  A three-stage approach to the automated scoring of spontaneous spoken responses , 2011, Comput. Speech Lang..