Multimodal photo annotation and retrieval on a mobile phone

Mobile phones are becoming multimedia devices. It is common to observe users capturing photos and videos on their mobile phones on a regular basis. As the amount of digital multimedia content expands, it becomes increasingly difficult to find specific images in the device. In this paper, we present a multimodal and mobile image retrieval prototype named MAMI (Multimodal Automatic Mobile Indexing). It allows users to annotate, index and search for digital photos on their phones via speech or image input. Speech annotations can be added at the time of capturing photos or at a later time. Additional metadata such as location, user identification, date and time of capture is stored in the phone automatically. A key advantage of MAMI is tha it is implemented as a stand-alone application which runs in real-time on the phone. Therefore, users can search for photos in their personal archives without the need of connectivity to a server. In this paper, we compare multimodal and monomodal approaches for image retrieval and we propose a novel algorithm named the Multimodal Redundancy Reduction (MR2) Algorithm. In addition to describing in detail the proposed approaches, we present our experimental results and compare the retrieval accuracy of monomodal versus multimodal algorithms.

[1]  Hua Li,et al.  Mobile Search With Multimodal Queries , 2008, Proceedings of the IEEE.

[2]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[3]  Nuria Oliver,et al.  Multimodal and Mobile Personal Image Retrieval: A User Study , 2008 .

[4]  Nuria Oliver,et al.  MAMI: multimodal annotations on a camera phone , 2008, Mobile HCI.

[5]  Josef Kittler,et al.  Combining classifiers , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[6]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[7]  C. Won,et al.  Efficient Use of MPEG‐7 Edge Histogram Descriptor , 2002 .

[8]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[9]  Josef Kittler,et al.  Combining multiple classifiers by averaging or by multiplying? , 2000, Pattern Recognit..

[10]  Timothy J. Hazen,et al.  Speech-based annotation and retrieval of digital photographs , 2007, INTERSPEECH.

[11]  B. S. Manjunath,et al.  Automatic video annotation through search and mining , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[12]  Mor Naaman,et al.  Zurfer: mobile multimedia access in spatial, social and topical context , 2007, ACM Multimedia.

[13]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[14]  Alan F. Smeaton,et al.  Mobile access to personal digital photograph archives , 2005, Mobile HCI.

[15]  Xing Xie,et al.  Photo-to-search: using multimodal queries to search the web from mobile devices , 2005, MIR '05.