Who are they looking at? Automatic Eye Gaze Following for Classroom Observation Video Analysis

We develop an end-to-end neural network-based computer vision system to automatically identify where each person within a 2-D image of a school classroom is looking (“gaze following”), as well as who she/he is looking at. Automatic gaze following could help facilitate data-mining of large datasets of classroom observation videos that are collected routinely in schools around the world in order to understand social interactions between teachers and students. Our network is based on the architecture by [27] but is extended to predict whether each person is looking at a target inside or outside the image; and to predict not only where, but who the person is looking at. Moreover, since our focus is on classroom observation videos, we collected a dataset from scratch of publicly available classroom sessions from 70 YouTube videos and collected labels from 408 labelers who annotated a total of 17, 758 gazes in 2, 263 unique image frames. Results of our experiments indicate that the proposed neural network can estimate the gaze target – either the spatial location or the face of a person – with substantially higher accuracy compared to several baselines.

[1]  Robert C. Pianta,et al.  Classroom Assessment Scoring System™: Manual K-3. , 2008 .

[2]  S. Kontos,et al.  Teachers' Interactions with Children: Why Are They So Important? Research in Review. , 1997 .

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Andrew Zisserman,et al.  Detecting People Looking at Each Other in Videos , 2014, International Journal of Computer Vision.

[5]  Daniel F. McCaffrey,et al.  Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment. Research Paper. MET Project. , 2013 .

[6]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[7]  Daniel F. Parks,et al.  Complementary effects of gaze direction and early saliency in guiding fixations during free viewing. , 2014, Journal of vision.

[8]  Peter A. Beling,et al.  Classroom Video Assessment and Retrieval via Multiple Instance Learning , 2011, AIED.

[9]  Ryan Shaun Joazeiro de Baker,et al.  Automatic Detection of Learning-Centered Affective States in the Wild , 2015, IUI.

[10]  Ashish Kapoor,et al.  Automatic prediction of frustration , 2007, Int. J. Hum. Comput. Stud..

[11]  Jacob Whitehill,et al.  Harnessing Label Uncertainty to Improve Modeling: An Application to Student Engagement Recognition , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[12]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[13]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[14]  Andrew J. Mashburn,et al.  Measures of classroom quality in prekindergarten and children's development of academic, language, and social skills. , 2008, Child development.

[15]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[16]  Antonio Torralba,et al.  Following Gaze in Video , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrew Olney,et al.  Multimodal Capture of Teacher-Student Interactions for Automated Dialogic Analysis in Live Classrooms , 2015, ICMI.

[18]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Wojciech Matusik,et al.  Eye Tracking for Everyone , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xingyu Pan,et al.  Automatic classification of activities in classroom discourse , 2014, Comput. Educ..

[21]  Tao Deng,et al.  Where Does the Driver Look? Top-Down-Based Saliency Detection in a Traffic Driving Environment , 2016, IEEE Transactions on Intelligent Transportation Systems.

[22]  Huaizu Jiang,et al.  Face Detection with the Faster R-CNN , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[23]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  A. Meltzoff,et al.  Gaze following: A mechanism for building social connections between infants and adults. , 2014 .

[26]  Azizah Jaafar,et al.  Eye Tracking in Educational Games Environment: Evaluating User Interface Design through Eye Tracking Patterns , 2011, IVIC.

[27]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[28]  N. Emery,et al.  The eyes have it: the neuroethology, function and evolution of social gaze , 2000, Neuroscience & Biobehavioral Reviews.

[29]  Kristy Elizabeth Boyer,et al.  Automatically Recognizing Facial Expression: Predicting Engagement and Frustration , 2013, EDM.

[30]  Neil Martin Robertson,et al.  Deep Head Pose: Gaze-Direction Estimation in Multimodal Video , 2015, IEEE Transactions on Multimedia.

[31]  Andrew Olney,et al.  Multi-sensor modeling of teacher instructional segments in live classrooms , 2016, ICMI.