TVGraz : Multi-Modal Learning of Object Categories by Combining Textual and Visual Features

Internet offers a vast amount of multi-modal and heterogeneous information mainly in the form of textual and visual data. Most of the current web-based visual object classification methods only utilize one of these data streams. As we will show in this paper, combining these modalities in a proper way often provides better results not attainable by relying on only one of these data streams. However, up to our knowledge, there is no publicly available dataset for benchmarking algorithms which use textual and visual data simultaneously. Therefore, in this work, we present an annotated multi-modal dataset, named TVGraz, which currently contains 10 visual object categories. The visual appearance of the objects in the dataset is challenging and offers a less biased benchmark. In order to facilitate the usage of this dataset in vision community, we additionally provide a preprocessed text data by using VIPS (VIsion-based Page Segmentation) method. We use a Multiple Kernel Learning (MKL) method to combine the textual and visual features in a proper way and show improved classification and ranking results with respect to the using only one of the data streams.

[1]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Beng Chin Ooi,et al.  Giving meanings to WWW images , 2000, MM 2000.

[3]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[4]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[5]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[6]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[7]  Wei-Ying Ma,et al.  Hierarchical clustering of WWW image search results using visual, textual and link information , 2004, MULTIMEDIA '04.

[8]  Pietro Perona,et al.  A Visual Category Filter for Google Images , 2004, ECCV.

[9]  Zhiguo Gong,et al.  Web image indexing by using associated texts , 2005, Knowledge and Information Systems.

[10]  Tao Qin,et al.  Web image clustering by consistent utilization of visual features and surrounding texts , 2005, MULTIMEDIA '05.

[11]  Efstratios Gallopoulos,et al.  Design of a matlab tool-box for term-document matrix generation , 2005 .

[12]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[14]  David A. Forsyth,et al.  Animals on the Web , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[17]  Antonio Criminisi,et al.  Harvesting Image Databases from the Web , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Chris Pal,et al.  Mining the web for visual concepts , 2008, MDM '08.

[19]  Jing Hua,et al.  Graph theoretical framework for simultaneously integrating visual and textual features for efficient web image clustering , 2008, WWW.