Text versus speech: a comparison of tagging input modalities for camera phones

Speech and typed text are two common input modalities for mobile phones. However, little research has compared them in their ability to support annotation and retrieval of digital pictures on mobile devices. In this paper, we report the results of a month-long field study in which participants took pictures with their camera phones and had the choice of adding annotations using speech, typed text, or both. Subsequently, the same subjects participated in a controlled experiment where they were asked to retrieve images based on annotations as well as retrieve annotations based on images in order to study the ability of each modality to effectively support users' recall of the previously captured pictures. Results demonstrate that each modality has advantages and shortcomings for the production of tags and retrieval of pictures. Several guidelines are suggested when designing tagging applications for portable devices.

[1]  Rohini K. Srihari,et al.  Multimedia indexing and retrieval , 1998, SIGF.

[2]  Mor Naaman,et al.  Why we tag: motivations for annotation in mobile and online media , 2007, CHI.

[3]  Kwan Min Lee,et al.  Speech Versus Touch: A Comparative Study of the Use of Speech and DTMF Keypad for Navigation , 2005, Int. J. Hum. Comput. Interact..

[4]  Mohan S. Kankanhalli,et al.  SmartAlbum: a multi-modal photo annotation system , 2002, MULTIMEDIA '02.

[5]  Alexander C. Loui,et al.  Using event segmentation to improve indexing of consumer photographs , 2001, SIGIR '01.

[6]  A. Baddeley Human Memory: Theory and Practice, Revised Edition , 1990 .

[7]  Nuria Oliver,et al.  Multimodal and Mobile Personal Image Retrieval: A User Study , 2008 .

[8]  Helen Mitchard,et al.  Experimental Comparisons of Data Entry by Automated Speech Recognition, Keyboard, and Mouse , 2002, Hum. Factors.

[9]  Michael L. Creech,et al.  FotoFile: a consumer multimedia organization and retrieval system , 1999, CHI '99.

[10]  Alexandros Potamianos,et al.  A Study in Efficiency and Modality Usage in Multimodal Form Filling Systems , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Bo Thiesson,et al.  Search Vox: leveraging multimodal refinement and partial knowledge for mobile voice search , 2008, UIST '08.

[12]  Karen Spärck Jones,et al.  Open-vocabulary speech indexing for voice and video mail retrieval , 1997, MULTIMEDIA '96.

[13]  Alexander I. Rudnicky,et al.  A Comparison of Speech and Typed Input , 1990, HLT.

[14]  Sehchang Hah,et al.  Comparison of Speech with Keyboard and Mouse as the Text Entry Method , 2005 .

[15]  Marc Davis,et al.  Metadata creation system for mobile images , 2004, MobiSys '04.

[16]  Philippe Mulhem,et al.  A Method for Photograph Indexing Using Speech Annotation , 2001, IEEE Pacific Rim Conference on Multimedia.

[17]  Timothy J. Hazen,et al.  Speech-based annotation and retrieval of digital photographs , 2007, INTERSPEECH.

[18]  Paul A. Cairns,et al.  Tlk or txt? Using voice input for SMS composition , 2008, Personal and Ubiquitous Computing.

[19]  Alexander I. Rudnicky Mode preference in a simple data-retrieval task , 1993, HLT.

[20]  Mary Czerwinski,et al.  Semi-Automatic Image Annotation , 2001, INTERACT.

[21]  Kerry Rodden,et al.  How do people manage their digital photographs? , 2003, CHI '03.

[22]  Abigail Sellen,et al.  The ubiquitous camera: an in-depth study of camera phone use , 2005, IEEE Pervasive Computing.

[23]  Jiejun Xu,et al.  Multimodal photo annotation and retrieval on a mobile phone , 2008, MIR '08.

[24]  David Sinclair,et al.  Managing photos with AT&T Shoebox (demonstration session) , 2000, SIGIR '00.

[25]  B. Shneiderman,et al.  Motivating Annotation for Personal Digital Photo Libraries : Lowering Barriers While Raising Incentives , 2022 .