Towards modelling visual ambiguity for visual object detection

The widespread adoption of Web 2.0 applications has resulted in the creation of huge amounts of user-generated multimedia content, a fact that motivated the investigation of employing this content for training. However, the nature of these annotations (i.e. global level) and the noise existing in the associated information, as well as the ambiguity that characterizes these examples disqualifies them from being directly appropriate learning samples. Nevertheless, the tremendous volume of data that is currently hosted in social networks gives us the luxury to disregard a substantial number of candidate learning examples, provided we can devise a gauging mechanism that could filter out any ambiguous or noisy samples. Our objective in this work is to define a measure for visual ambiguity, which is caused by the visual similarity of semantically dissimilar concepts, in order to help in the process of selecting positive training regions from user tagged images. This is done by limiting the search space of the potential images to the ones yielding a higher probability to contain the desired regions, while at the same time not including visually ambiguous objects that could confuse the selection algorithm. Experimental results show that the employment of visual ambiguity allows for better separation between the targeted true positive and the undesired negative regions.

[1]  Yiannis Kompatsiaris,et al.  Active learning in social context for image classification , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[2]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yiannis Kompatsiaris,et al.  Multi-modal region selection approach for training object detectors , 2012, ICMR.

[4]  Jianping Fan,et al.  Leveraging loosely-tagged images and inter-object correlations for tag recommendation , 2010, ACM Multimedia.

[5]  Claire Cardie,et al.  Bootstrapping Coreference Classifiers with Multiple Machine Learning Algorithms , 2003, EMNLP.

[6]  Kristen Grauman,et al.  Large-scale live active learning: Training object detectors with crawled data and crowds , 2011, CVPR.

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[9]  Marcel Worring,et al.  Bootstrapping Visual Categorization With Relevant Negatives , 2013, IEEE Transactions on Multimedia.

[10]  Siddharth Patwardhan,et al.  Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatednes , 2003 .

[11]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[12]  Yiannis Kompatsiaris,et al.  Leveraging social media for scalable object detection , 2012, Pattern Recognit..