An egocentric perspective on active vision and visual object learning in toddlers

Toddlers quickly learn to recognize thousands of everyday objects despite the seemingly suboptimal training conditions of a visually cluttered world. One reason for this success may be that toddlers do not just passively perceive visual information, but actively explore and manipulate objects around them. The work in this paper is based on the idea that active viewing and exploration creates "clean" egocentric scenes that serve as high-quality training data for the visual system. We tested this idea by collecting first-person video data of free toy play between toddler-parent pairs. We use the raw frames from this data, weakly annotated with toy object labels, to train state-of-the-art machine learning models for object recognition (Convolutional Neural Networks, or CNNs). We run several training simulations, varying quantity and quality of the training data. Our results show that scenes captured by parents and toddlers have different properties, and that toddler scenes lead to models that learn more robust visual representations of the toy objects in them.

[1]  P D Eimas,et al.  Perceptual cues that permit categorical differentiation of animal species by infants. , 1996, Journal of experimental child psychology.

[2]  M J Tarr,et al.  Perceptual categorization of cat and dog silhouettes by 3- to 4-month-old infants. , 2001, Journal of experimental child psychology.

[3]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[4]  Linda B. Smith,et al.  Active Information Selection: Visual Attention Through the Hands , 2009, IEEE Transactions on Autonomous Mental Development.

[5]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Linda B. Smith,et al.  Early biases and developmental changes in self-generated object views. , 2010, Journal of vision.

[7]  Scott P. Johnson,et al.  Systems in development: motor skill acquisition facilitates three-dimensional object completion. , 2010, Developmental psychology.

[8]  Linda B. Smith,et al.  Not your mother's view: the dynamics of toddler visual experience. , 2011, Developmental science.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Linda B. Smith,et al.  Objects in the center: How the infant's body constrains infant scenes , 2016, 2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Linda B. Smith,et al.  Active Viewing in Toddlers Facilitates Visual Object Learning: An Egocentric Vision Approach , 2016, CogSci.

[14]  Isabel Gauthier,et al.  Visual Object Recognition: Do We (Finally) Know More Now Than We Did? , 2016, Annual review of vision science.