Semantic retrieval of personal photos using a deep autoencoder fusing visual features with speech annotations represented as word/paragraph vectors

It is very attractive for the user to retrieve photos from a huge collection using high-level personal queries (e.g. “uncle Bill’s house”), but technically very challenging. Previous works proposed a set of approaches toward the goal assuming only 30% of the photos are annotated by sparse spoken descriptions when the photos are taken. In this paper, to promote the interaction between different types of features, we use the continuous space word representations to train a paragraph vector model for the speech annotation, and then fuse the paragraph vector with the visual features produced by deep Convolutional Neural Network (CNN) using a Deep AutoEncoder (DAE). The retrieval framework therefore combines the word vectors and paragraph vectors of the speech annotations, the CNN-based visual features, and the DAE-based fused visual/speech features in a three-stage process including a two-layer random walk. The retrieval performance was significantly improved in the preliminary experiments.

[1]  Lin-Shan Lee,et al.  Enhancing sparse voice annotation for semantic retrieval of personal photos by continuous space word representations , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Alex Acero,et al.  Soft indexing of speech content for search in spoken documents , 2007, Comput. Speech Lang..

[3]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[4]  Lin-Shan Lee,et al.  Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA) , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  Lin-Shan Lee,et al.  Analytical comparison between position specific posterior lattices and confusion networks based on words and subword units for spoken document indexing , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[8]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[9]  Florian Metze,et al.  Two-layer mutually reinforced random walk for improved multi-party meeting summarization , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[10]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[11]  Yi-Hsuan Yang,et al.  ContextSeer: context search and recommendation at query time for shared consumer photos , 2008, ACM Multimedia.

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Wenjie Li,et al.  Mutually Reinforced Manifold-Ranking Based Relevance Propagation Model for Query-Focused Multi-Document Summarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Pierre Tirilly,et al.  Language modeling for bag-of-visual words image categorization , 2008, CIVR '08.

[15]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Lin-Shan Lee,et al.  Latent semantic retrieval of personal photos with sparse user annotation by fused image/speech/text features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Tele Tan,et al.  An Improved Method for Image Retrieval Using Speech Annotation , 2003, MMM.

[20]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[21]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[22]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[23]  Lin-Shan Lee,et al.  Semantic retrieval of personal photos using matrix factorization and two-layer random walk fusing sparse speech annotations with visual features , 2014, INTERSPEECH.

[24]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[28]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[29]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[30]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[31]  Timothy J. Hazen,et al.  Speech-based annotation and retrieval of digital photographs , 2007, INTERSPEECH.