Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?

YAXIAO SONG: Multi-modal Surrogates for Retrieving and Making Sense of Videos: Is Synchronization between the Multiple Modalities Optimal? (Under the direction of Dr. Gary Marchionini) Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users’ retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manuallygenerated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and

[1]  Brian C. O'Connor,et al.  Modelling what users see when they look at images: a cognitive viewpoint , 2002, J. Documentation.

[2]  Rick Kazman,et al.  Using 3D sound as a navigational aid in virtual environments , 2004, Behav. Inf. Technol..

[3]  Rolf A. Zwaan,et al.  Situation models in language comprehension and memory. , 1998, Psychological bulletin.

[4]  Jarice Hanson,et al.  Understanding video applications,impact,and theory , 1987 .

[5]  Ichiro Ide,et al.  An automatic video indexing method based on shot classification , 2001, Systems and Computers in Japan.

[6]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[7]  R. Mayer,et al.  A Split-Attention Effect in Multimedia Learning: Evidence for Dual Processing Systems in Working Memory , 1998 .

[8]  Erwin Panofsky,et al.  Meaning in the Visual Arts: Papers in and on Art History , 1955 .

[9]  K. A. Ericsson,et al.  Protocol Analysis: Verbal Reports as Data , 1984 .

[10]  Charles A. Bouman,et al.  ViBE: a compressed video database structured for active browsing and search , 2004, IEEE Transactions on Multimedia.

[11]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  R. Shepard Recognition memory for words, sentences, and pictures , 1967 .

[13]  Hung-Khoon Tan,et al.  Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and search , 2007, TRECVID.

[14]  Milind R. Naphade,et al.  Video retrieval and relevance feedback in the context of a post-integration model , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[15]  Raymond W. Kulhavy,et al.  Comparing Elaboration and Dual Coding Theories: The Case of Maps and Text , 1993 .

[16]  Baoxin Li,et al.  A general framework for sports video summarization with its application to soccer , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Atreyi Kankanhalli,et al.  Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[18]  Paul Over,et al.  The trecvid 2007 BBC rushes summarization evaluation pilot , 2007, TVS '07.

[19]  Sonja Zillner,et al.  Semantics and CBIR: a medical imaging perspective , 2008, CIVR '08.

[20]  Dragutin Petkovic,et al.  CueVideo: automated multimedia indexing and retrieval , 1999, MULTIMEDIA '99.

[21]  A. Paivio Coding Distinctions and Repetition Effects in Memory1 , 1975 .

[22]  Daniel P. W. Ellis,et al.  Pitch-based emphasis detection for characterization of meeting recordings , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[23]  Michael G. Christel,et al.  Evolving video skims into useful multimedia abstractions , 1998, CHI.

[24]  Ali N. Akansu,et al.  Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing , 2001, Multimedia Tools and Applications.

[25]  Anoop Gupta,et al.  Comparing presentation summaries: slides vs. reading vs. listening , 2000, CHI.

[26]  Alexander G. Hauptmann,et al.  Adjustable filmstrips and skims as abstractions for a digital video library , 1999, Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries.

[27]  Akio Nagasaka,et al.  Automatic Video Indexing and Full-Video Search for Object Appearances , 1991, VDB.

[28]  A. Berger Media Analysis Techniques , 1982 .

[29]  James M. Turner Comparing User-Assigned Terms with Indexer-Assigned Terms for Storage and Retrieval of Moving Images: Research Results. , 1995 .

[30]  Jun Yang,et al.  Exploring temporal consistency for video analysis and retrieval , 2006, MIR '06.

[31]  Bernard Mérialdo,et al.  Comparison of Multiepisode Video Summarization Algorithms , 2003, EURASIP J. Adv. Signal Process..

[32]  Herbert A. Leeper,et al.  Listening Rate Preference: Comparison of Two Time Alteration Techniques , 1977 .

[33]  Mark Ginsburg,et al.  Client-side monitoring for Web mining , 2003, J. Assoc. Inf. Sci. Technol..

[34]  Sara Shatford,et al.  Analyzing the Subject of a Picture: A Theoretical Approach , 1986 .

[35]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[36]  Dragutin Petkovic,et al.  Key to effective video retrieval: effective cataloging and browsing , 1998, MULTIMEDIA '98.

[37]  Paul Over,et al.  TRECVID: evaluating the effectiveness of information retrieval tasks on digital video , 2004, MULTIMEDIA '04.

[38]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[39]  Ramesh C. Jain,et al.  Digital video segmentation , 1994, MULTIMEDIA '94.

[40]  Anthony F. Martone,et al.  Automated closed-captioning using text alignment , 2003, IS&T/SPIE Electronic Imaging.

[41]  Howard D. Wactlar,et al.  Informedia - Search and Summarization in the Video Medium , 2000 .

[42]  Dennis J. Delprato,et al.  Mind and Its Evolution: A Dual Coding Theoretical Approach , 2009 .

[43]  Sally Jo Cunningham,et al.  A transaction log analysis of a digital library , 2000, International Journal on Digital Libraries.

[44]  Michael Cole,et al.  Development of free recall learning in children. , 1971 .

[45]  A. Paivio,et al.  Pictures and words in visual search , 1974, Memory & cognition.

[46]  Berna Erol,et al.  Multimodal summarization of meeting recordings , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[47]  Robert Spence Rapid, Serial and Visual: A Presentation Technique with Potential , 2002 .

[48]  Ian Begg,et al.  Recall of Meaningful Phrases. , 1972 .

[49]  Gary Marchionini,et al.  Measures of User Performance in Video Retrieval Research , 2003 .

[50]  Wessel Kraaij,et al.  TRECVID 2005-An Introduction , 2005 .

[51]  Ajay Divakaran,et al.  Automatic extraction of soccer video highlights using a combination of motion and audio features , 2003, IS&T/SPIE Electronic Imaging.

[52]  A. Paivio,et al.  Memory for pictures and sounds: independence of auditory and visual codes. , 1994, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[53]  Xin Liu,et al.  Video summarization and retrieval using singular value decomposition , 2003, Multimedia Systems.

[54]  Noboru Babaguchi,et al.  Event Based Video Indexing by Intermodal Collaboration , 1999 .

[55]  Robert W. Donaldson,et al.  Adaptive silence deletion for speech storage and voice mail applications , 1988, IEEE Trans. Acoust. Speech Signal Process..

[56]  J. Schooler,et al.  Verbal overshadowing of visual memories: Some things are better left unsaid , 1990, Cognitive Psychology.

[57]  A. Paivio Imagery and verbal processes , 1972 .

[58]  Gary Marchionini,et al.  How fast is too fast? evaluating fast forward surrogates for digital video , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[59]  Takeo Kanade,et al.  Techniques for the Creation and Exploration of Digital Video Libraries , 1996 .

[60]  Barry Arons,et al.  Pitch-based emphasis detection for segmenting speech recordings , 1994, ICSLP.

[61]  Meng Yang An exploration of users' video relevance criteria , 2005 .

[62]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  Jake K. Aggarwal,et al.  CIRES: a system for content-based retrieval in digital image libraries , 2002, 7th International Conference on Control, Automation, Robotics and Vision, 2002. ICARCV 2002..

[64]  Gary Marchionini,et al.  Agileviews: A Human-Centered Framework for Interfaces to Information Spaces. , 2000 .

[65]  M. Ibrahim Sezan,et al.  Hierarchical video summarization , 1998, Electronic Imaging.

[66]  Gary Marchionini,et al.  Surrogation for Digital Video : A Design Framework , 2006 .

[67]  Hyung-Myung Kim,et al.  Summarization of news video and its description for content‐based access , 2003, Int. J. Imaging Syst. Technol..

[68]  Gary Marchionini,et al.  Multimodal surrogates for video browsing , 1999, DL '99.

[69]  Wei-Hao Lin,et al.  Summarizing BBC Rushes the Informedia Way , 2007, TRECVID.

[70]  Allan H. Gilbert,et al.  Studies In Iconology: Humanistic Themes In The Art Of The Renaissance , 1939 .

[71]  Jakob Nielsen,et al.  Measuring usability: preference vs. performance , 1994, CACM.

[72]  Vojkan Mihajlovic,et al.  Multimodal Content-based Video Retrieval , 2007, Multimedia Retrieval.

[73]  Michael A. Smith,et al.  Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[74]  HongJiang Zhang,et al.  Automatic parsing of TV soccer programs , 1995, Proceedings of the International Conference on Multimedia Computing and Systems.

[75]  Branimir Boguraev,et al.  Lexical cohesion, discourse segmentation and document summarization , 2000, RIAO.

[76]  Walter Bender,et al.  Salient Stills: Process and Practice , 1996, IBM Syst. J..

[77]  Zygmunt Pizlo,et al.  Automated video program summarization using speech transcripts , 2006, IEEE Transactions on Multimedia.

[78]  Shih-Fu Chang,et al.  Experiments for Multiple Level Classification of Visual Descriptors , 1999 .

[79]  Alan F. Smeaton,et al.  Developing, Deploying and Assessing Usage of a Movie Archive System among Students of Film Studies , 2009, HCI.

[80]  Donald A. Adjeroh,et al.  Mechanisms for Automatic Extraction of Primary Features for Video Indexing , 1995, ICSC.

[81]  Ying Li,et al.  Creating MAGIC: system for generating learning object metadata for instructional content , 2005, MULTIMEDIA '05.

[82]  Shih-Fu Chang,et al.  Experiments in indexing multimedia data at multiple levels. , 2011 .

[83]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[84]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[85]  Frank Bentley,et al.  Automatic and user-centric approaches to video summary evaluation , 2007, Electronic Imaging.

[86]  Kiyoharu Aizawa,et al.  Evaluation of video summarization for a large number of cameras in ubiquitous home , 2005, MULTIMEDIA '05.

[87]  A. Murat Tekalp,et al.  Two-stage hierarchical video summary extraction to match low-level user browsing preferences , 2003, IEEE Trans. Multim..

[88]  Gary Marchionini,et al.  Effects of audio and visual surrogates for making sense of digital video , 2007, CHI.

[89]  A. Paivio,et al.  Pictures and words as stimulus and response items in paired-associate learning of young children. , 1968, Journal of experimental child psychology.

[90]  M. Wertheimer Experimental studies on the seeing of motion , 1961 .

[91]  Cor J. Veenman,et al.  The influence of cross-validation on video classification performance , 2006, MM '06.

[92]  Gary Marchionini,et al.  Alternative Surrogates for Video Objects in a Digital Library: Users' Perspectives on Their Relative Usability , 2002, ECDL.

[93]  Shih-Fu Chang,et al.  Conceptual framework for indexing visual information at multiple levels , 1999, Electronic Imaging.

[94]  Michael G. Christel,et al.  Mining Novice User Activity with TRECVID Interactive Retrieval Tasks , 2006, CIVR.

[95]  Ann C. Weller,et al.  Using Transaction Log Analysis to Improve OPAC Retrieval Results , 1998 .

[96]  Ying Zhang,et al.  Time series analysis of a Web search engine transaction log , 2009, Inf. Process. Manag..

[97]  Lalitha Agnihotri,et al.  Summarization of video programs based on closed captions , 2000, IS&T/SPIE Electronic Imaging.

[98]  S E Gerber,et al.  The Limiting Effect of Discard Interval On Time-Compressed Speech , 1977, Language and Speech.

[99]  Edward J. Delp,et al.  Automated video summarization using speech transcripts , 2001, IS&T/SPIE Electronic Imaging.

[100]  Maarten de Rijke,et al.  Exploiting redundancy in cross-channel video retrieval , 2007, MIR '07.

[101]  Jawaid A. Ghani,et al.  The Experience Of Flow In Computer-Mediated And In Face-To-Face Groups , 1991, ICIS.

[102]  Abby Goodrum,et al.  Multidimensional scaling of video surrogates , 2001, J. Assoc. Inf. Sci. Technol..

[103]  G W Heiman,et al.  Word intelligibility decrements and the comprehension of time-compressed speech , 1986, Perception & psychophysics.

[104]  Wolfgang Effelsberg,et al.  Robust clustering-based video-summarization with integration of domain-knowledge , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[105]  A. Paivio,et al.  Picture superiority in free recall: Imagery or dual coding? , 1973 .

[106]  José María Martínez Sanchez,et al.  Event Detection and Clustering for Surveillance Video Summarization , 2008, 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services.

[107]  Alan Hanjalic,et al.  An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis , 1999, IEEE Trans. Circuits Syst. Video Technol..

[108]  Lynn Wilcox,et al.  Enhanced video browsing using automatically extracted audio excerpts , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[109]  Shingo Uchihashi,et al.  Video Manga: generating semantically meaningful video summaries , 1999, MULTIMEDIA '99.

[110]  Gary Marchionini,et al.  Multimedia surrogates for video gisting: Toward combining spoken words and imagery , 2009, Inf. Process. Manag..

[111]  Qingsheng Zhu,et al.  Automatic metadata generation based on neural network , 2004, InfoSecu '04.

[112]  Fred D. Davis Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology , 1989, MIS Q..

[113]  Gary Marchionini,et al.  Text or Pictures? An Eyetracking Study of How People View Digital Video Surrogates , 2003, CIVR.

[114]  Michael G. Christel Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in Surrogates, and Evaluation , 2009, Automated Metadata in Multimedia Information Systems.

[115]  Rong Yan,et al.  How many high-level concepts will fill the semantic gap in news video retrieval? , 2007, CIVR '07.

[116]  Jake K. Aggarwal,et al.  Feature Integration, Multi-image Queries and Relevance Feedback in Image Retrieval , 2003 .

[117]  Jonathan Foote,et al.  Summarizing video using non-negative similarity matrix factorization , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[118]  Edward P. Neuburg,et al.  Simple pitch‐dependent algorithm for high‐quality speech‐rate changing , 1977 .

[119]  Amanda Spink,et al.  Image searching on the Excite Web search engine , 2001, Inf. Process. Manag..

[120]  Peter G. B. Enser Pictorial information retrieval , 1995 .

[121]  Patrick Bouthemy,et al.  Motion-Based Selection of Relevant Video Segments for Video Summarization , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[122]  Visual Perception and Motion Picture Spectatorship , 1997 .

[123]  Anoop Gupta,et al.  Auto-summarization of audio-video presentations , 1999, MULTIMEDIA '99.

[124]  Daniel DeMenthon,et al.  Automatic Performance Evaluation for Video Summarization , 2004 .

[125]  Anoop Gupta,et al.  Time-compression: systems concerns, usage, and benefits , 1999, CHI '99.

[126]  Mark A. McDaniel,et al.  Memory for odors and odor names : Modalities of elaboration and imagery , 1990 .

[127]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[128]  David Bordwell David Bordwell - A Case for Cognitivism , 2022 .

[129]  Yihong Gong Summarizing Audiovisual Contents of a Video Program , 2003, EURASIP J. Adv. Signal Process..

[130]  A. Murat Tekalp,et al.  Automatic Soccer Video Analysis and Summarization , 2003, IS&T/SPIE Electronic Imaging.

[131]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[132]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[133]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[134]  Michael G. Christel,et al.  Improving Access to a Digital Video Library , 1997, INTERACT.

[135]  Jose Abracos,et al.  Statistical methods for retrieving most significant paragraphs in newspaper articles , 1997, Workshop On Intelligent Scalable Text Summarization.

[136]  Zhu Liu,et al.  AT&T Research at TRECVID 2010 , 2010, TRECVID.

[137]  Chun-Ming Lai,et al.  News Video Summarization Based on Spatial and Motion Feature Analysis , 2004, PCM.

[138]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[139]  T. Grodal Moving Pictures: A New Theory of Film Genres, Feelings, and Cognition , 1997 .

[140]  Peter G. B. Enser,et al.  Analysis of user need in image archives , 1997, J. Inf. Sci..

[141]  Jonathan Foote,et al.  Automatic Music Summarization via Similarity Analysis , 2002, ISMIR.

[142]  Mohamed Abdel-Mottaleb,et al.  Content-based video retrieval by example video clip , 1997, Electronic Imaging.

[143]  Gary Marchionini Human performance measures for video retrieval , 2006, MIR '06.

[144]  Gustavo Carneiro,et al.  Supervised Learning of Semantic Classes for Image Annotation and Retrieval , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.