Audio-Visual Evaluation of Oratory Skills

What makes a talk successful? Is it the content or the presentation? We try to estimate the contribution of the speaker’s oratory skills to the talk’s success, while ignoring the content of the talk. By oratory skills we refer to facial expressions, motions and gestures, as well as the vocal features. We use TED Talks as our dataset, and measure the success of each talk by its view count. Using this dataset we train a neural network to assess the oratory skills in a talk through three factors: body pose, facial expressions, and acoustic features.Most previous work on automatic evaluation of oratory skills uses hand-crafted expert annotations for both the quality of the talk and for the identification of predefined actions. Unlike prior art, we measure the quality to be equivalent to the view count of the talk as counted by TED, and allow the network to automatically learn the actions, expressions, and sounds that are relevant to the success of a talk. We find that oratory skills alone contribute substantially to the chances of a talk being successful.

[1]  Alistair Sutherland,et al.  Multimodal system for public speaking with real time feedback: a positive computing perspective , 2016, ICMI.

[2]  Nils L. Westhausen,et al.  Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression , 2020, INTERSPEECH.

[3]  G.W.M. Rauterberg,et al.  Online feedback system for public speakers , 2012, 2012 IEEE Symposium on E-Learning, E-Management and E-Services.

[4]  Mohan S. Kankanhalli,et al.  Multi-sensor Self-Quantification of Presentations , 2015, ACM Multimedia.

[5]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[6]  Peter van Rosmalen,et al.  Presentation Trainer, your Public Speaking Multimodal Coach , 2015, ICMI.

[7]  Kris Demuynck,et al.  ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[8]  Mohammed E. Hoque,et al.  Rhema: A Real-Time In-Situ Intelligent Interface to Help People with Public Speaking , 2015, IUI.

[9]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[10]  Xavier Ochoa,et al.  The RAP system: automatic feedback of oral presentation skills using multimodal analysis and low-cost sensors , 2018, LAK.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mike Thelwall,et al.  ResearchGate articles: Age, discipline, audience size, and impact , 2017, J. Assoc. Inf. Sci. Technol..

[14]  Neal Grandgenett Ted: Ideas Worth Spreading , 2012 .

[15]  Neil Thurman Newspaper Consumption in the Digital Age , 2014 .

[16]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[17]  Johannes Schöning,et al.  Augmenting Social Interactions: Realtime Behavioural Feedback using Social Signal Processing Techniques , 2015, CHI.

[18]  Lisa M. Schreiber,et al.  The Development and Test of the Public Speaking Competence Rubric , 2012 .

[20]  Mohammed E. Hoque,et al.  AutoManner: An Automated Interface for Making Public Speakers Aware of Their Mannerisms , 2016, IUI.

[21]  Charles E. Hughes,et al.  Providing Real-time Feedback for Student Teachers in a Virtual Rehearsal Environment , 2015, ICMI.