Summarization and Classification of Wearable Camera Streams by Learning the Distributions over Deep Features of Out-of-Sample Image Sequences

A popular approach to training classifiers of new image classes is to use lower levels of a pre-trained feed-forward neural network and retrain only the top. Thus, most layers simply serve as highly nonlinear feature extractors. While these features were found useful for classifying a variety of scenes and objects, previous work also demonstrated unusual levels of sensitivity to the input especially for images which are veering too far away from the training distribution. This can lead to surprising results as an imperceptible change in an image can be enough to completely change the predicted class. This occurs in particular in applications involving personal data, typically acquired with wearable cameras (e.g., visual lifelogs), where the problem is also made more complex by the dearth of new labeled training data that make supervised learning with deep models difficult. To alleviate these problems, in this paper we propose a new generative model that captures the feature distribution in new data. Its latent space then becomes more representative of the new data, while still retaining the generalization properties. In particular, we use constrained Markov walks over a counting grid for modeling image sequences, which not only yield good latent representations, but allow for excellent classification with only a handful of labeled training examples of the new scenes or objects, a scenario typical in lifelogging applications.

[1]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  G. O'loughlin,et al.  Using a wearable camera to increase the accuracy of dietary analysis. , 2013, American journal of preventive medicine.

[3]  Nebojsa Jojic,et al.  Multidimensional counting grids: Inferring word order from disordered bags of words , 2011, UAI.

[4]  Sam T. Roweis,et al.  Constrained Hidden Markov Models , 1999, NIPS.

[5]  Jon M. Kleinberg,et al.  Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis , 2003, NIPS.

[6]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[7]  Byoung-Tak Zhang,et al.  Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors , 2016, IJCAI.

[8]  Bolei Zhou,et al.  Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  Qingshan Liu,et al.  Abnormal detection using interaction energy potentials , 2011, CVPR 2011.

[11]  Petia Radeva,et al.  Toward Storytelling From Visual Lifelogging: An Overview , 2015, IEEE Transactions on Human-Machine Systems.

[12]  Alan F. Smeaton,et al.  Experiences of Aiding Autobiographical Memory Using the SenseCam , 2012, Hum. Comput. Interact..

[13]  Alessandro Perina,et al.  A comparison of crowd commotion measures from generative models , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[15]  Nebojsa Jojic,et al.  Structural epitome: a way to summarize one's visual experience , 2010, NIPS.

[16]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[17]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Nanning Zheng,et al.  Counting Grid Aggregation for Event Retrieval and Recognition , 2016, ArXiv.

[21]  Nebojsa Jojic,et al.  Spring Lattice Counting Grids: Scene Recognition Using Deformable Positional Constraints , 2012, ECCV.

[22]  Baochang Zhang,et al.  Location recognition on lifelog images via a discriminative combination of generative models , 2014, BMVC.

[23]  Petia Radeva,et al.  With whom do I interact? Detecting social interactions in egocentric photo-streams , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[24]  Joo-Hwee Lim,et al.  Multimodal Multi-Stream Deep Learning for Egocentric Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  Robert B. Fisher,et al.  The BEHAVE video dataset: ground truthed video for multi-person behavior classification , 2010 .

[26]  Nebojsa Jojic,et al.  Capturing Spatial Interdependence in Image Features: The Counting Grid, an Epitomic Representation for Bags of Features , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  William M. Wells,et al.  Efficient Synthesis of Gaussian Filters by Cascaded Uniform Filters , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..