论文信息 - Using term clouds to represent segment-level semantic content of podcasts

Using term clouds to represent segment-level semantic content of podcasts

Spoken audio, like any time-continuous medium, is notoriously difficult to browse or skim without support of an interface providing semantically annotated jump points to signal the user where to listen in. Creation of time-aligned metadata by human annotators is prohibitively expensive, motivating the investigation of representations of segment-level semantic content based on transcripts generated by automatic speech recognition (ASR). This paper examines the feasibility of using term clouds to provide users with a structured representation of the semantic content of podcast episodes. Podcast episodes are visualized as a series of sub-episode segments, each represented by a term cloud derived from a transcript generated by automatic speech recognition (ASR). Quality of segment-level term clouds is measured quantitatively and their utility is investigated using a small-scale user study based on human labeled segment boundaries. Since the segment-level clouds generated from ASR-transcripts prove useful, we examine an adaptation of text tiling techniques to speech in order to be able to generate segments as part of a completely automated indexing and structuring system for browsing of spoken audio. Results demonstrate that the segments generated are comparable with human selected segment boundaries.

[1] Alex Acero,et al. Indexing uncertainty for spoken document search , 2005, INTERSPEECH.

[2] Petra Wagner,et al. Speech synthesis development made easy: the bonn open synthesis system , 2001, INTERSPEECH.

[3] Jonathan Foote,et al. Content-based retrieval of music and audio , 1997, Other Conferences.

[4] Daniela Karin Rosner,et al. Tag Clouds: Data Analysis Tool or Social Signaller? , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[5] Alexander G. Hauptmann,et al. Informedia: news-on-demand multimedia information acquisition and retrieval , 1997 .

[6] Marek Fisz,et al. Infinitely Divisible Distributions: Recent Results and Applications , 1962 .

[7] Andreas Stolcke,et al. Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Peter Schäuble,et al. Information Retrieval can Cope with Many Errors , 2000, Information Retrieval.

[9] Marti A. Hearst. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[10] Beth Logan,et al. Speechbot: an experimental speech-based search engine for multimedia content on the web , 2002, IEEE Trans. Multim..

[11] Mark J. F. Gales,et al. Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Pei-Yun Hsueh,et al. Audio-based unsupervised segmentation of multiparty dialogue , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14] S. Renals,et al. Content-based access to spoken audio , 2005, IEEE Signal Processing Magazine.

[15] Biing-Hwang Juang,et al. Combining key-phrase detection and subword-based verification for flexible speech understanding , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16] Fernando Pereira,et al. Spoken Content-Based Audio Navigation (SCAN) , 2007 .

[17] Pavel Matejka,et al. Search in Speech for Public Security and Defense , 2007 .

[18] Andreas Stolcke,et al. Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[19] Peng Yu,et al. Vocabulary-independent indexing of spontaneous speech , 2005, IEEE Transactions on Speech and Audio Processing.

[20] Joachim Köhler,et al. Improvement speaker clustering using global similarity features , 2006, INTERSPEECH.

[21] Benjamin Bigot,et al. Towards the detection and the characterization of conversational speech zones in audiovisual documents , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[22] Ellen M. Voorhees,et al. The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[23] James H. Martin,et al. Speech and Language Processing An Introduction to Natural Language Processing , Computational Linguistics , and Speech Recognition Second Edition , 2008 .

[24] Fabian Mörchen,et al. Time Series Knowledge Mining , 2006 .

[25] P. Boersma. ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[26] Carol Peters,et al. From CLEF to TrebleCLEF: promoting technology transfer for multilingual information retrieval , 2007 .

[27] Akinori Ito,et al. A new word pre-selection method based on an extended redundant hash addressing for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28] Ke Zhang,et al. Examining the contributions of automatic speech transcriptions and metadata sources for searching spontaneous conversational speech , 2007 .

[29] Hui Jiang,et al. Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[30] Johanna D. Moore,et al. AUTOMATIC TOPIC SEGMENTATION AND LABELING IN MULTIPARTY DIALOGUE , 2006, 2006 IEEE Spoken Language Technology Workshop.

[31] Lori Lamel,et al. On designing pronunciation lexicons for large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32] Mark A. Clements,et al. Phonetic Searching vs. LVCSR: How to Find What You Really Want in Audio Archives , 2002, Int. J. Speech Technol..

[33] James Allan. Perspectives on Information Retrieval and Speech , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[34] Alan F. Smeaton,et al. Automatically Segmenting LifeLog Data into Events , 2008, 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services.

[35] Martha Larson,et al. Using syllable-based indexing features and language models to improve German spoken document retrieval , 2003, INTERSPEECH.

[36] Susanne Burger,et al. WHAT MAKES SPEECH DATA SPONTANEOUS , 1999 .

[37] Stefano Ferilli,et al. Automatic content-based indexing of digital documents through intelligent processing techniques , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[38] Alex Acero,et al. INTEGRATION OF METADATA IN SPOKEN DOCUMENT SEARCH USING POSITION SPECIFIC POSTERIOR LATICES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[39] Peng Yu,et al. Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures , 2006, NAACL.

[40] Steve J. Young,et al. A fast lattice-based approach to vocabulary independent wordspotting , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[41] Peng Yu,et al. A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[42] Xunying Liu,et al. Development of the 2004 CU-HTK English CTS systems using more than two thousand hours of data , 2004 .

[43] Lukás Burget,et al. Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[44] Jonathan G. Fiscus,et al. Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[45] Barry Arons,et al. SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[46] Herbert Gish,et al. Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[47] Steve Renals,et al. Recognition, indexing and retrieval of british broadcast news with the THISL system , 1999, EUROSPEECH.

[48] Joel D. Martin,et al. Extracting Keyphrases from Spoken Audio Documents , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[49] Jing Huang,et al. Automatic speech recognition performance on a voicemail transcription task , 2002, IEEE Trans. Speech Audio Process..

[50] Peng Yu,et al. Vocabulary-independent search in spontaneous speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51] Martin Karafiát,et al. Using Smoothed Heteroscedastic Linear Discriminant Analysis in Large Vocabulary Continuous Speech Recognition System ? , .

[52] Julia Hirschberg,et al. Story Segmentation of Broadcast News in English, Mandarin and Arabic , 2006, NAACL.

[53] Oren Etzioni,et al. Extracting Product Features and Opinions from Reviews , 2005, HLT.

[54] Lukás Burget,et al. Indexing and Search Methods for Spoken Documents , 2006, TSD.

[55] James Glass,et al. Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[56] Piet Mertens,et al. The Prosogram: Semi-Automatic Transcription of Prosody Based on a Tonal Perception Model , 2004 .

[57] Yu Shi,et al. Towards spoken-document retrieval for the enterprise: Approximate word-lattice indexing with text indexers , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[58] Hermann Ney,et al. Using posterior word probabilities for improved speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[59] Eric Crestan,et al. Browsing Help for a Faster Retrieval , 2004, COLING.

[60] Gareth J. F. Jones,et al. Overview of the CLEF-2005 Cross-Language Speech Retrieval Track , 2005, CLEF.

[61] B. Gnedenko,et al. Limit Distributions for Sums of Independent Random Variables , 1955 .

[62] Beth Logan,et al. Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio , 2002 .

[63] Alex Acero,et al. Position Specific Posterior Lattices for Indexing Speech , 2005, ACL.

[64] Martin Franz,et al. Story segmentation of broadcast news in Arabic, Chinese and English using multi-window features , 2007, SIGIR.

[65] Florian Metze,et al. Getting closer: tailored human–computer speech dialog , 2009, Universal Access in the Information Society.

[66] Alexander G. Hauptmann,et al. SPEECH RECOGNITION AND INFORMATION RETRIEVAL: EXPERIMENTS IN RETRIEVING SPOKEN DOCUMENTS , 1997 .

[67] Florian Metze,et al. Speaker Classification for Next‐Generation Voice‐Dialog Systems , 2008 .

[68] Baoxin Li,et al. Bridging the semantic gap in sports video retrieval and summarization , 2004, J. Vis. Commun. Image Represent..

[69] Gareth J. F. Jones,et al. Automated Alignment and Annotation of Audio-Visual Presentations , 2002, ECDL.

[70] William H. Press,et al. Numerical recipes , 1990 .

[71] Murat Saraclar,et al. Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[73] João Paulo da Silva Neto,et al. Audio segmentation, classification and clustering in a broadcast news task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[74] Franciska de Jong,et al. Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition , 2007, SAMT.

[75] Kenney Ng,et al. Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[76] Ross Wilkinson,et al. Experiments in spoken document retrieval using phoneme n-grams , 2000, Speech Commun..

[77] Karen Spärck Jones,et al. Open-vocabulary speech indexing for voice and video mail retrieval , 1997, MULTIMEDIA '96.

[78] Richard M. Schwartz,et al. Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[79] Yusef Hassan-Montero,et al. Improving Tag-Clouds as Visual Information Retrieval Interfaces , 2024, 2401.04947.

[80] Geoffrey Zweig,et al. Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[81] Richard Sproat,et al. Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[82] Wolfgang Hürst. User Interfaces for Speech-Based Retrieval of Lecture Recordings , 2004 .

[83] Gunnar Evermann,et al. Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[84] Jean-Luc Gauvain,et al. Transcription de la parole conversationnelle , 2004 .

[85] Lori Lamel,et al. Multilingual phone recognition of spontaneous telephone speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[86] Elie el Khoury,et al. Speaker Diarization: Towards a More Robust and Portable System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[87] James R. Glass,et al. Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[88] Martine Adda-Decker,et al. The 300k LIMSI German broadcast news transcription system , 2003, INTERSPEECH.

[89] Karen Spärck Jones,et al. Automatic content-based retrieval of broadcast news , 1995, MULTIMEDIA '95.

[90] Aides à la navigation dans un corpus de transcriptions d’oral , 2007, JEPTALNRECITAL.

[91] Stefan Eickeler,et al. Content-based video indexing of TV broadcast news using hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[92] Kiyohiro Shikano,et al. Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[93] Cyril Allauzen,et al. General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[94] Steve Young,et al. Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[95] Frédéric Bimbot,et al. Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[96] Fernando Pereira,et al. Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[97] Steve Young,et al. The HTK book , 1995 .

[98] Andreas Stolcke,et al. The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[99] Alexander I. Rudnicky,et al. A texttiling based approach to topic boundary detection in meetings , 2006, INTERSPEECH.

[100] Lukás Burget,et al. Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[101] Bhuvana Ramabhadran,et al. Automatic recognition of spontaneous speech for access to multilingual oral history archives , 2004, IEEE Transactions on Speech and Audio Processing.

[102] M. Sanderson,et al. The relationship of word error rate to document ranking , 2003 .

[103] Larson,et al. Supporting radio archive workflows with vocabulary independent spoken keyword search , 2007 .

[104] Daben Liu,et al. Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[105] Jean-Luc Gauvain,et al. CallSurf: Automatic Transcription, Indexing and Structuration of Call Center Conversational Speech for Knowledge Extraction and Query by Content , 2008, LREC.

[106] Martha Larson,et al. Term clouds as surrogates for user generated speech , 2008, SIGIR '08.