An Active Data Representation of Videos for Automatic Scoring of Oral Presentation Delivery Skills and Feedback Generation

Public speaking is an important skill, the acquisition of which requires dedicated and time consuming training. In recent years, researchers have started to investigate automatic methods to support public speaking skills training. These methods include assessment of the trainee's oral presentation delivery skills which may be accomplished through automatic understanding and processing of social and behavioral cues displayed by the presenter. In this study, we propose an automatic scoring system for presentation delivery skills using a novel active data representation method to automatically rate segments of a full video presentation. While most approaches have employed a two step strategy consisting of detecting multiple events followed by classification, which involve the annotation of data for building the different event detectors and generating a data representation based on their output for classification, our method does not require event detectors. The proposed data representation is generated unsupervised using low-level audiovisual descriptors and self-organizing mapping and used for video classification. This representation is also used to analyse video segments within a full video presentation in terms of several characteristics of the presenter's performance. The audio representation provides the best prediction results for self-confidence and enthusiasm, posture and body language, structure and connection of ideas, and overall presentation delivery. The video data representation provides the best results for presentation of relevant information with good pronunciation, usage of language according to audience, and maintenance of adequate voice volume for the audience. The fusion of audio and video data provides the best results for eye contact. Applications of the method to provision of feedback to teachers and trainees are discussed.

[1]  Jianping Fan,et al.  NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification , 2018, ECCV Workshops.

[2]  Emer Gilmartin,et al.  Conversational Engagement Recognition Using Auditory and Visual Cues , 2016, INTERSPEECH.

[3]  Anthony E. Ward The assessment of public speaking: A pan-European view , 2013, 2013 12th International Conference on Information Technology Based Higher Education and Training (ITHET).

[4]  Lisa M. Schreiber,et al.  The Development and Test of the Public Speaking Competence Rubric , 2012 .

[5]  Larry Ambrose,et al.  The power of feedback. , 2002, Healthcare executive.

[6]  Lei Chen,et al.  Using Multimodal Cues to Analyze MLA'14 Oral Presentation Quality Corpus: Presentation Delivery and Slides Quality , 2014, MLA@ICMI.

[7]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  H. Traunmüller,et al.  The perceptual evaluation of F0 excursions in speech as evidenced in liveliness estimations. , 1995, The Journal of the Acoustical Society of America.

[9]  Nicu Sebe,et al.  Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off , 2015, International Journal of Multimedia Information Retrieval.

[10]  Rafael A. Calvo,et al.  Improving Medical Students’ Awareness of Their Non-Verbal Communication through Automated Non-Verbal Behavior Feedback , 2016, Front. ICT.

[11]  Xing Zhang,et al.  Non-local NetVLAD Encoding for Video Classification , 2018, ECCV Workshops.

[12]  Rebecca Hincks,et al.  Measures and perceptions of liveliness in student oral presentation speech: A proposal for an automatic feedback mechanism , 2005 .

[13]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[14]  Shivam Garg,et al.  Learning Video Features for Multi-label Classification , 2018, ECCV Workshops.

[15]  Jarek Krajewski,et al.  Comparing Multiple Classifiers for Speech-Based Detection of Self-Confidence - A Pilot Study , 2010, 2010 20th International Conference on Pattern Recognition.

[16]  Markus Peura,et al.  The Self-Organizing Map of Trees , 1998, Neural Processing Letters.

[17]  Gareth J. F. Jones,et al.  Effects of Good Speaking Techniques on Audience Engagement , 2015, ICMI.

[18]  Hillary Anger Elfenbein,et al.  The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures , 2019, Nature Human Behaviour.

[19]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Rahul Sukthankar,et al.  The 2nd YouTube-8M Large-Scale Video Understanding Challenge , 2018, ECCV Workshops.

[21]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[22]  Benjamin Lok,et al.  Predicting Student Success in Communication Skills Learning Scenarios with Virtual Humans , 2019, LAK.

[23]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[24]  R. Brandt,et al.  21st Century Skills: Rethinking How Students Learn , 2010 .

[25]  J. Hattie,et al.  The Power of Feedback , 2007 .

[26]  H H Stassen,et al.  Speaking behavior and voice sound characteristics in depressive patients during recovery. , 1993, Journal of psychiatric research.

[27]  Xavier Ochoa,et al.  Presentation Skills Estimation Based on Video and Kinect Data Analysis , 2014, MLA@ICMI.

[28]  Carl Vogel,et al.  Visual, Laughter, Applause and Spoken Expression Features for Predicting Engagement Within TED Talks , 2017, INTERSPEECH.

[29]  Nick Campbell,et al.  Presentation quality assessment using acoustic information and hand movements , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[31]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[32]  Xavier Ochoa,et al.  The RAP system: automatic feedback of oral presentation skills using multimodal analysis and low-cost sensors , 2018, LAK.

[33]  Sara J. White,et al.  Public speaking revisited: delivery, structure, and style. , 2010, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[34]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[35]  Giancarlo Fortino,et al.  Human emotion recognition using deep belief network architecture , 2019, Inf. Fusion.

[36]  Xavier Ochoa,et al.  Estimation of Presentations Skills Based on Slides and Audio Features , 2014, MLA@ICMI.

[37]  Fasih Haider,et al.  Prediction of Emotions from Text using Sentiment Analysis for Expressive Speech Synthesis , 2016, SSW.

[38]  Saturnino Luz,et al.  Attitude Recognition Using Multi-resolution Cochleagram Features , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Xavier Ochoa,et al.  MLA'14: Third Multimodal Learning Analytics Workshop and Grand Challenges , 2014, ICMI.

[40]  Nick Campbell,et al.  Attitude recognition of video bloggers using audio-visual descriptors , 2016, MA3HMI@ICMI.

[41]  Monique Thonnat,et al.  A New Hybrid Architecture for Human Activity Recognition from RGB-D Videos , 2019, MMM.

[42]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[43]  Tej Singh,et al.  Human Activity Recognition in Video Benchmarks: A Survey , 2018, Lecture Notes in Electrical Engineering.

[44]  スミス アンソニー,et al.  Hand gesture recognition system and method , 1997 .

[45]  Dan Grandstaff,et al.  Speaking as a Professional: Enhance Your Therapy or Coaching Practice through Presentations, Workshops, and Seminars , 2004 .