Automatic Image Annotation by Ensemble of Visual Descriptors

Automatic image annotation systems available in the literature concatenate color, texture and/or shape features in a single feature vector to learn a set of high level semantic categories using a single learning machine. This approach is quite naive to map the visual features to high level semantic information concerning the categories. Concatenation of many features with different visual properties and wide dynamical ranges may result in curse of dimensionality and redundancy problems. Additionally, it usually requires normalization which may cause an undesirable distortion in the feature space. An elegant way of reducing the effects of these problems is to design a dedicated feature space for each image category, depending on its content, and learn a range of visual properties of the whole image from a variety of feature sets. For this purpose, a two-layer ensemble learning system, called Supervised Annotation by Descriptor Ensemble (SADE), is proposed. SADE, initially, extracts a variety of low-level visual descriptors from the image. Each descriptor is, then, fed to a separate learning machine in the first layer. Finally, the meta-layer classifier is trained on the output of the first layer classifiers and the images are annotated by using the decision of the meta-layer classifier. This approach not only avoids normalization, but also reduces the effects of dimensional curse and redundancy. The proposed system outperforms a state-of-the-art automatic image annotation system, in an equivalent experimental setup.

[1]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[3]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[4]  David B. Skalak,et al.  Prototype Selection for Composite Nearest Neighbor Classifiers , 1995 .

[5]  Ian H. Witten,et al.  Stacked generalization: when does it work? , 1997, IJCAI 1997.

[6]  Y. Mori,et al.  Image-to-word transformation based on dividing and vector quantizing images with words , 1999 .

[7]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[9]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[10]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[11]  Erkki Oja,et al.  PicSOM-self-organizing image retrieval with MPEG-7 content descriptors , 2002, IEEE Trans. Neural Networks.

[12]  Johannes Fürnkranz,et al.  Combining Pairwise Classifiers with Stacking , 2003, IDA.

[13]  R. Manmatha,et al.  A Model for Learning the Semantics of Pictures , 2003, NIPS.

[14]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[15]  Daniel Gatica-Perez,et al.  On image auto-annotation with latent space models , 2003, ACM Multimedia.

[16]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[17]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..

[18]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[19]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  R. Manmatha,et al.  Using Maximum Entropy for Automatic Image Annotation , 2004, CIVR.

[21]  Fatos T. Yarman-Vural,et al.  An Image Retrieval System Based on Region Classification , 2004, ISCIS.

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[24]  Christos Faloutsos,et al.  GCap: Graph-based Automatic Image Captioning , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[25]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[26]  Luo Si,et al.  Effective automatic image annotation via a coherent language model and active learning , 2004, MULTIMEDIA '04.

[27]  Christos Faloutsos,et al.  Automatic image captioning , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[28]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[29]  James Ze Wang,et al.  Content-based image retrieval: approaches and trends of the new age , 2005, MIR '05.

[30]  Stefan M. Rüger,et al.  Automated Image Annotation Using Global Features and Robust Nonparametric Density Estimation , 2005, CIVR.

[31]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Tom Minka,et al.  Vision texture for annotation , 1995, Multimedia Systems.

[33]  Fatos T. Yarman-Vural,et al.  A Hierarchical Classification System Based on Adaptive Resonance Theory , 2006, 2006 International Conference on Image Processing.

[34]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.