ICT Tools for Searching, Annotation and Analysis of Audiovisual Media

1. This report concerns the use of ICT tools in research in the arts and humanities using speech, mu­ sic, video and film in digital form, hereafter referred to as AV (audio­visual material). 2. The quantity of AV available to researchers is now massive and rapidly expanding, far exceeding the quantity of available print material in sheer number of bytes. 3. The main problem for researchers is no longer a paucity of AV but how to locate the material of in­ terest in the vast quantity available, and how to organise material once collected. 4. Metadata and tagging continue to be important to facilitate search. Standards for metadata for AV do exist but are not yet widely adopted. 5. Content­based search is becoming possible for speech, but is still beyond the horizon for music, and even more distant for video and film. Mixed speech, music and noise is very hard to search. 6. Copyright protection hampers research with AV, and digital rights management systems (DRM) threaten to prevent research altogether. 7. Once AV has been located and accessed, much research proceeds by annotation, for which many tools exist. Systems for reuse and sharing of annotations are in their infancy, however. 8. Many researchers make some kind of transcription of AV, and would value tools to automate this process. For speech, such tools exist with important limits to their accuracy and applicability. 9. Full music transcription tools do not exist, but researchers can benefit from other sorts of visualisa­ tions, for which tools do exist. 10. Researchers could work more effectively with better knowledge of ICT. A common failing is not so much ignorance of how to use particular tools as a misunderstanding of the processes the computer carries out and the validity of its results. 11. In Section 1.3, recommendations are made concerning: i. provision of ICT infrastructure for arts and humanities research, ii. training for researchers, iii. copyright law and digital rights management (DRM), iv. resource development unlikely to receive commercial support, v. dissemination of expertise and examples in research on AV with ICT, vi. standards and commercial tools, vii. metadata and digitisation projects outside the research community, viii. management of researchers' private collections of AV, ix. deposit and sharing of AV, including annotations of AV. and many others for informal conversations. We also gratefully acknowledge the generous amount of time and information given by all of the participants …

[1]  Takeo Kanade,et al.  Informedia Digital Video Library , 1995, CACM.

[2]  G. Leech,et al.  Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus , 1997 .

[3]  Paul Over,et al.  TRECVID 2004 - An Overview , 2004, TRECVID.

[4]  S. Renals,et al.  Content-based access to spoken audio , 2005, IEEE Signal Processing Magazine.

[5]  Jean-Luc Gauvain,et al.  Structuring Broadcast Audio for Information Access , 2003, EURASIP J. Adv. Signal Process..

[6]  John Makhoul AGILE: Autonomous Global Integrated Language Exploitation , 2008 .

[7]  Eliezer Rapoport,et al.  Schoenberg-Hartleben’s Pierrot Lunaire: Speech – Poem – Melody – Vocal Performance , 2004 .

[8]  Gary Marchionini,et al.  The Open Video Digital Library , 2002, D Lib Mag..

[9]  Lawrence Wai-Choong Wong,et al.  ANSES: Summarisation of News Video , 2003, CIVR.

[10]  Cherié L. Weible,et al.  The Internet Movie Database , 2001 .

[11]  Marc Leman,et al.  Collecting Ground Truth Annotations for Drum Detection in Polyphonic Music , 2005, ISMIR.

[12]  Òscar Celma,et al.  MUCOSA: A Music Content Semantic Annotator , 2005, ISMIR.

[13]  James Allan Robust Techniques for Organizing and Retrieving Spoken Documents , 2003, EURASIP J. Adv. Signal Process..

[14]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[15]  N. Dirubbo,et al.  You Just Don??t Understand , 1992 .

[16]  Stewart Binns The British Empire in colour , 2002 .

[17]  Tiziana Catarci,et al.  Digital memories in an era of ubiquitous computing and abundant storage , 2006, CACM.

[18]  Remco C. Veltkamp,et al.  Muugle: A music retrieval experimentation framework , 2006 .

[19]  Bhuvana Ramabhadran,et al.  Building an information retrieval test collection for spontaneous conversational speech , 2004, SIGIR '04.

[20]  Geoffrey Symcox War on Land , 1974 .

[21]  Jonathan Bignell,et al.  EXEMPLARITY, PEDAGOGY AND TELEVISION HISTORY , 2005 .

[22]  Kenneth Ward Church Speech and language processing: where have we been and where are we going? , 2003, INTERSPEECH.

[23]  Anssi Klapuri,et al.  Automatic Music Transcription as We Know it Today , 2004 .

[24]  Jonathan G. Fiscus,et al.  The Rich Transcription 2005 Spring Meeting Recognition Evaluation , 2005, MLMI.

[25]  Roger K. Moore A comparison of the data requirements of automatic speech recognition systems and human listeners , 2003, INTERSPEECH.

[26]  Simon O'Keefe,et al.  On Techniques for Content-Based Visual Annotation to Aid Intra-Track Music Navigation , 2005, ISMIR.

[27]  Gary Marchionini,et al.  Open video: A framework for a test collection , 2000, J. Netw. Comput. Appl..

[28]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[29]  Brian C. Smith,et al.  Query By Humming , 2001 .

[30]  Masataka Goto,et al.  Musicream: New Music Playback Interface for Streaming, Sticking, Sorting, and Recalling Musical Pieces , 2005, ISMIR.

[31]  David A. van Leeuwen,et al.  NIST and NFI-TNO evaluations of automatic speaker recognition , 2006, Comput. Speech Lang..

[32]  Esther Grabe,et al.  English Intonation in the British Isles , 1999 .

[33]  Steve Whittaker,et al.  Novel techniques for time-compressing speech: an exploratory study , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[34]  Gerhard Widmer,et al.  MATCH: A Music Alignment Tool Chest , 2005, ISMIR.

[35]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[36]  Ned Quist,et al.  Naxos Music Library (review) , 2004 .

[37]  P. Enser,et al.  Archival moving imagery in the digital environment , 2003 .

[38]  Yorick Wilks,et al.  Dialogue Act Classification Based on Intra-Utterance Features∗ , 2005 .

[39]  Gary Simons,et al.  The Open Language Archives Community: An Infrastructure for Distributed Archiving of Language Resources , 2003, Lit. Linguistic Comput..

[40]  Sadaoki Furui,et al.  50 Years of Progress in Speech and Speaker Recognition Research , 1970 .

[41]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[42]  Eric J. Isaacson What You See Is What You Get: on Visualizing Music , 2005, ISMIR.

[43]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[44]  Alon Efrat,et al.  Search the Audio, Browse the Video—A Generic Paradigm for Video Collections , 2003, EURASIP J. Adv. Signal Process..

[45]  William P. Birmingham,et al.  Query by Humming: How good can it get? , 2003, SIGIR 2003.

[46]  Bhuvana Ramabhadran,et al.  Automated transcription and topic segmentation of large spoken archives , 2003, INTERSPEECH.

[47]  Xavier Rodet,et al.  Toward Automatic Music Audio Summary Generation from Signal Analysis , 2002, ISMIR.

[48]  Jane Samson,et al.  The British Empire , 2001 .

[49]  Emilia Gómez,et al.  Tonality Visualization of Polyphonic audio , 2005, ICMC.

[50]  Jo Fox,et al.  Filming Women in the Third Reich , 2000 .

[51]  Roger K. Moore Computer Speech and Language , 1986 .

[52]  Jenn Riley,et al.  Variations2: retrieving and using music in an academic setting , 2006, CACM.

[53]  John M. Gauch,et al.  The vision digital video library , 1997, Inf. Process. Manag..

[54]  Alvin F. Martin,et al.  NIST 2003 language recognition evaluation , 2003, INTERSPEECH.

[55]  Andreas Stolcke,et al.  Structural metadata research in the EARS program , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[56]  Mark Kornbluh,et al.  Digitizing Speech Recordings for Archival Purposes , 2002 .

[57]  Sanjeev Khudanpur,et al.  Pronunciation change in conversational speech and its implications for automatic speech recognition , 2004, Comput. Speech Lang..

[58]  Gail McMillan,et al.  Open Archives Initiative , 2000 .

[59]  Konstantinos Koumpis,et al.  The Role of Prosody in a Voicemail Summarization System , 2001 .

[60]  Mark J. F. Gales,et al.  Development of the CU-HTK 2004 broadcast news transcription systems , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[61]  Yiqing Liang,et al.  A digital video library on the World Wide Web , 1997, MULTIMEDIA '96.

[62]  Norberto James,et al.  The Immigrants , 2000 .

[63]  Bhuvana Ramabhadran,et al.  Cross-Language Access to Recorded Speech in the MALACH Project , 2002, TSD.

[64]  Richard Wright,et al.  Accessing the spoken word , 2005, International Journal on Digital Libraries.

[65]  Alexander G. Hauptmann Lessons for the Future from a Decade of Informedia Video Analysis Research , 2005, CIVR.

[66]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[67]  Katsumi Tanaka,et al.  Proposal of integrated search engine of web and TV contents , 2006, WWW '06.

[68]  Beth Logan,et al.  Approaches to reduce the effects of OOV queries on indexed spoken audio , 2005, IEEE Transactions on Multimedia.

[69]  James Allan Perspectives on Information Retrieval and Speech , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[70]  John H. L. Hansen,et al.  Speechfind: an experimental on-line spoken document retrieval system for historical audio archives , 2002, INTERSPEECH.

[71]  Marc Leman,et al.  Methodological Considerations Concerning Manual Annotation Of Musical Audio In Function Of Algorithm Development , 2004, ISMIR.

[72]  H. C. Longuet-Higgins,et al.  Perception of melodies , 1976, Nature.

[73]  Robert Dale,et al.  Handbook of Natural Language Processing , 2001, Computational Linguistics.

[74]  John H. L. Hansen,et al.  Transcript-free search of audio archives for the national gallery of the spoken word , 2001, JCDL '01.

[75]  David Meredith,et al.  The ps13 pitch spelling algorithm , 2006 .

[76]  Ralph Grishman INFORMATION EXTRACTION AND SPEECH RECOGNITION , 1998 .

[77]  J. Anderson,et al.  Digital Resources for the Humanities 2001-2002: An Edited Selection of Papers , 2003 .

[78]  Lin-shan Lee,et al.  Spoken document understanding and organization , 2005, IEEE Signal Processing Magazine.

[79]  Sadaoki Furui,et al.  Fifty years of progress in speech and speaker recognition , 2004 .

[80]  R. Rosenzweig Scarcity or Abundance? Preserving the Past in a Digital Era , 2003 .

[81]  Mario Nöcker,et al.  Databionic Visualization of Music Collections According to Perceptual Distance , 2005, ISMIR.

[82]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[83]  Massimo Poesio,et al.  Using high level dialogue information for dialogue act recognition using prosodic features. , 1999 .

[84]  Bhuvana Ramabhadran,et al.  Supporting access to large digital oral history archives , 2002, JCDL '02.

[85]  Deborah Tannen,et al.  You Just Don't Understand , 1990 .

[86]  Marc Swerts,et al.  Dialogue and prosody , 2002, Speech Commun..

[87]  Andreas Stolcke,et al.  Human language technology: opportunities and challenges , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[88]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[89]  Lucia Nixon,et al.  Paper, Video, Internet: New Technologies for Research and Teaching in Archaeology: The Sphakia Survey , 2004 .

[90]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[91]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[92]  David García,et al.  The CLAM Annotator: A Cross-Platform Audio Descriptors Editing Tool , 2005, ISMIR.