Semantic Indexing of Wearable Camera Images: Kids'Cam Concepts

In order to provide content-based search on image media, including images and video, they are typically accessed based on manual or automatically assigned concepts or tags, or sometimes based on image-image similarity depending on the use case. While great progress has been made in very recent years in automatic concept detection using machine learning, we are still left with a mis-match between the semantics of the concepts we can automatically detect, and the semantics of the words used in a user's query, for example. In this paper we report on a large collection of images from wearable cameras gathered as part of the Kids'Cam project, which have been both manually annotated from a vocabulary of 83 concepts, and automatically annotated from a vocabulary of 1,000 concepts. This collection allows us to explore issues around how language, in the form of two distinct concept vocabularies or spaces, one manually assigned and thus forming a ground-truth, is used to represent images, in our case taken using wearable cameras. It also allows us to discuss, in general terms, issues around mis-match of concepts in visual media, which derive from language mis-matches. We report the data processing we have completed on this collection and some of our initial experimentation in mapping across the two language vocabularies.

[1]  Georges Quénot,et al.  TRECVid Semantic Indexing of Video: A 6-year Retrospective , 2016 .

[2]  Meng Wang,et al.  Correlative multilabel video annotation with temporal kernels , 2008, TOMCCAP.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[7]  Lifeng Sun,et al.  Characterizing everyday activities from visual lifelogs based on enhancing concept representation , 2016, Comput. Vis. Image Underst..

[8]  Cordelia Schmid,et al.  The AXES research video search system , 2014, ICASSP 2014.

[9]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10]  Jianping Fan,et al.  Correlative multi-label multi-instance image annotation , 2011, 2011 International Conference on Computer Vision.

[11]  Juliano Efson Sales,et al.  DINFRA: A One Stop Shop for Computing Multilingual Semantic Relatedness , 2015, SIGIR.

[12]  Lifeng Sun,et al.  Towards Training-Free Refinement for Semantic Indexing of Visual Media , 2016, MMM.

[13]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[14]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[17]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18]  Alan F. Smeaton,et al.  LifeLogging: Personal Big Data , 2014, Found. Trends Inf. Retr..

[19]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .