An Ensemble Model Using Face and Body Tracking for Engagement Detection

Precise detection and localization of learners' engagement levels are useful for monitoring their learning quality. In the emotiW Challenge's engagement detection task, we proposed a series of novel improvements, including (a) a cluster-based framework for fast engagement level predictions, (b) a neural network using the attention pooling mechanism, (c) heuristic rules using body posture information, and (d) model ensemble for more accurate and robust predictions. Our experimental results suggest that our proposed methods effectively improved engagement detection performance. On the validation set, our system can reduce the baseline Mean Squared Error (MSE) by about 56%. On the final test set, our system yielded a competitively low MSE of 0.081.

[1]  Aly A. Farag,et al.  Toward active and unobtrusive engagement assessment of distance learners , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Gwen Littlewort,et al.  Computer Expression Recognition Toolbox , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[4]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[5]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[6]  Su-Youn Yoon,et al.  Automated scoring of interview videos using Doc2Vec multimodal feature extraction paradigm , 2016, ICMI.

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  Lei Chen,et al.  Automated video interview judgment on a large-sized corpus collected online , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[9]  Peter Robinson,et al.  Rendering of Eyes for Eye-Shape Registration and Gaze Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Peter Robinson,et al.  Constrained Local Neural Fields for Robust Facial Landmark Detection in the Wild , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Lexing Xie,et al.  Beyond Views: Measuring and Predicting Engagement on YouTube Videos , 2017, ArXiv.

[13]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Javier R. Movellan,et al.  The Faces of Engagement: Automatic Recognition of Student Engagementfrom Facial Expressions , 2014, IEEE Transactions on Affective Computing.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Peter Robinson,et al.  Cross-dataset learning and person-specific normalisation for automatic Action Unit detection , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[17]  Gwen Littlewort,et al.  The computer expression recognition toolbox (CERT) , 2011, Face and Gesture 2011.

[18]  Tamás D. Gedeon,et al.  EmotiW 2018: Audio-Video, Student Engagement and Group-Level Affect Prediction , 2018, ICMI.

[19]  Rafael A. Calvo,et al.  Automated Detection of Engagement Using Video-Based Estimation of Facial Expressions and Heart Rate , 2017, IEEE Transactions on Affective Computing.

[20]  Abhinav Dhall,et al.  Prediction and Localization of Student Engagement in the Wild , 2018, 2018 Digital Image Computing: Techniques and Applications (DICTA).

[21]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[22]  Daniel P. W. Ellis,et al.  Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems , 2015, ArXiv.