Using term clouds to represent segment-level semantic content of podcasts

Spoken audio, like any time-continuous medium, is notoriously difficult to browse or skim without support of an interface providing semantically annotated jump points to signal the user where to listen in. Creation of time-aligned metadata by human annotators is prohibitively expensive, motivating the investigation of representations of segment-level semantic content based on transcripts generated by automatic speech recognition (ASR). This paper examines the feasibility of using term clouds to provide users with a structured representation of the semantic content of podcast episodes. Podcast episodes are visualized as a series of sub-episode segments, each represented by a term cloud derived from a transcript generated by automatic speech recognition (ASR). Quality of segment-level term clouds is measured quantitatively and their utility is investigated using a small-scale user study based on human labeled segment boundaries. Since the segment-level clouds generated from ASR-transcripts prove useful, we examine an adaptation of text tiling techniques to speech in order to be able to generate segments as part of a completely automated indexing and structuring system for browsing of spoken audio. Results demonstrate that the segments generated are comparable with human selected segment boundaries.

[1]  Alex Acero,et al.  Indexing uncertainty for spoken document search , 2005, INTERSPEECH.

[2]  Petra Wagner,et al.  Speech synthesis development made easy: the bonn open synthesis system , 2001, INTERSPEECH.

[3]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[4]  Daniela Karin Rosner,et al.  Tag Clouds: Data Analysis Tool or Social Signaller? , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[5]  Alexander G. Hauptmann,et al.  Informedia: news-on-demand multimedia information acquisition and retrieval , 1997 .

[6]  Marek Fisz,et al.  Infinitely Divisible Distributions: Recent Results and Applications , 1962 .

[7]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Peter Schäuble,et al.  Information Retrieval can Cope with Many Errors , 2000, Information Retrieval.

[9]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[10]  Beth Logan,et al.  Speechbot: an experimental speech-based search engine for multimedia content on the web , 2002, IEEE Trans. Multim..

[11]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Pei-Yun Hsueh,et al.  Audio-based unsupervised segmentation of multiparty dialogue , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  S. Renals,et al.  Content-based access to spoken audio , 2005, IEEE Signal Processing Magazine.

[15]  Biing-Hwang Juang,et al.  Combining key-phrase detection and subword-based verification for flexible speech understanding , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Fernando Pereira,et al.  Spoken Content-Based Audio Navigation (SCAN) , 2007 .

[17]  Pavel Matejka,et al.  Search in Speech for Public Security and Defense , 2007 .

[18]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[19]  Peng Yu,et al.  Vocabulary-independent indexing of spontaneous speech , 2005, IEEE Transactions on Speech and Audio Processing.

[20]  Joachim Köhler,et al.  Improvement speaker clustering using global similarity features , 2006, INTERSPEECH.

[21]  Benjamin Bigot,et al.  Towards the detection and the characterization of conversational speech zones in audiovisual documents , 2008, 2008 International Workshop on Content-Based Multimedia Indexing.

[22]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[23]  James H. Martin,et al.  Speech and Language Processing An Introduction to Natural Language Processing , Computational Linguistics , and Speech Recognition Second Edition , 2008 .

[24]  Fabian Mörchen,et al.  Time Series Knowledge Mining , 2006 .

[25]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[26]  Carol Peters,et al.  From CLEF to TrebleCLEF: promoting technology transfer for multilingual information retrieval , 2007 .

[27]  Akinori Ito,et al.  A new word pre-selection method based on an extended redundant hash addressing for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Ke Zhang,et al.  Examining the contributions of automatic speech transcriptions and metadata sources for searching spontaneous conversational speech , 2007 .

[29]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[30]  Johanna D. Moore,et al.  AUTOMATIC TOPIC SEGMENTATION AND LABELING IN MULTIPARTY DIALOGUE , 2006, 2006 IEEE Spoken Language Technology Workshop.

[31]  Lori Lamel,et al.  On designing pronunciation lexicons for large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[32]  Mark A. Clements,et al.  Phonetic Searching vs. LVCSR: How to Find What You Really Want in Audio Archives , 2002, Int. J. Speech Technol..

[33]  James Allan Perspectives on Information Retrieval and Speech , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[34]  Alan F. Smeaton,et al.  Automatically Segmenting LifeLog Data into Events , 2008, 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services.

[35]  Martha Larson,et al.  Using syllable-based indexing features and language models to improve German spoken document retrieval , 2003, INTERSPEECH.

[36]  Susanne Burger,et al.  WHAT MAKES SPEECH DATA SPONTANEOUS , 1999 .

[37]  Stefano Ferilli,et al.  Automatic content-based indexing of digital documents through intelligent processing techniques , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[38]  Alex Acero,et al.  INTEGRATION OF METADATA IN SPOKEN DOCUMENT SEARCH USING POSITION SPECIFIC POSTERIOR LATICES , 2006, 2006 IEEE Spoken Language Technology Workshop.

[39]  Peng Yu,et al.  Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures , 2006, NAACL.

[40]  Steve J. Young,et al.  A fast lattice-based approach to vocabulary independent wordspotting , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Peng Yu,et al.  A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[42]  Xunying Liu,et al.  Development of the 2004 CU-HTK English CTS systems using more than two thousand hours of data , 2004 .

[43]  Lukás Burget,et al.  Sub-word modeling of out of vocabulary words in spoken term detection , 2008, 2008 IEEE Spoken Language Technology Workshop.

[44]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[45]  Barry Arons,et al.  SpeechSkimmer: a system for interactively skimming recorded speech , 1997, TCHI.

[46]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[47]  Steve Renals,et al.  Recognition, indexing and retrieval of british broadcast news with the THISL system , 1999, EUROSPEECH.

[48]  Joel D. Martin,et al.  Extracting Keyphrases from Spoken Audio Documents , 2001, SIGIR Workshop: Information Retrieval Techniques for Speech Applications.

[49]  Jing Huang,et al.  Automatic speech recognition performance on a voicemail transcription task , 2002, IEEE Trans. Speech Audio Process..

[50]  Peng Yu,et al.  Vocabulary-independent search in spontaneous speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Martin Karafiát,et al.  Using Smoothed Heteroscedastic Linear Discriminant Analysis in Large Vocabulary Continuous Speech Recognition System ? , .

[52]  Julia Hirschberg,et al.  Story Segmentation of Broadcast News in English, Mandarin and Arabic , 2006, NAACL.

[53]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[54]  Lukás Burget,et al.  Indexing and Search Methods for Spoken Documents , 2006, TSD.

[55]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[56]  Piet Mertens,et al.  The Prosogram: Semi-Automatic Transcription of Prosody Based on a Tonal Perception Model , 2004 .

[57]  Yu Shi,et al.  Towards spoken-document retrieval for the enterprise: Approximate word-lattice indexing with text indexers , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[58]  Hermann Ney,et al.  Using posterior word probabilities for improved speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[59]  Eric Crestan,et al.  Browsing Help for a Faster Retrieval , 2004, COLING.

[60]  Gareth J. F. Jones,et al.  Overview of the CLEF-2005 Cross-Language Speech Retrieval Track , 2005, CLEF.

[61]  B. Gnedenko,et al.  Limit Distributions for Sums of Independent Random Variables , 1955 .

[62]  Beth Logan,et al.  Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio , 2002 .

[63]  Alex Acero,et al.  Position Specific Posterior Lattices for Indexing Speech , 2005, ACL.

[64]  Martin Franz,et al.  Story segmentation of broadcast news in Arabic, Chinese and English using multi-window features , 2007, SIGIR.

[65]  Florian Metze,et al.  Getting closer: tailored human–computer speech dialog , 2009, Universal Access in the Information Society.

[66]  Alexander G. Hauptmann,et al.  SPEECH RECOGNITION AND INFORMATION RETRIEVAL: EXPERIMENTS IN RETRIEVING SPOKEN DOCUMENTS , 1997 .

[67]  Florian Metze,et al.  Speaker Classification for Next‐Generation Voice‐Dialog Systems , 2008 .

[68]  Baoxin Li,et al.  Bridging the semantic gap in sports video retrieval and summarization , 2004, J. Vis. Commun. Image Represent..

[69]  Gareth J. F. Jones,et al.  Automated Alignment and Annotation of Audio-Visual Presentations , 2002, ECDL.

[70]  William H. Press,et al.  Numerical recipes , 1990 .

[71]  Murat Saraclar,et al.  Hybrid language models for out of vocabulary word detection in large vocabulary conversational speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[73]  João Paulo da Silva Neto,et al.  Audio segmentation, classification and clustering in a broadcast news task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[74]  Franciska de Jong,et al.  Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition , 2007, SAMT.

[75]  Kenney Ng,et al.  Subword-based approaches for spoken document retrieval , 2000, Speech Commun..

[76]  Ross Wilkinson,et al.  Experiments in spoken document retrieval using phoneme n-grams , 2000, Speech Commun..

[77]  Karen Spärck Jones,et al.  Open-vocabulary speech indexing for voice and video mail retrieval , 1997, MULTIMEDIA '96.

[78]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[79]  Yusef Hassan-Montero,et al.  Improving Tag-Clouds as Visual Information Retrieval Interfaces , 2024, 2401.04947.

[80]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[81]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[82]  Wolfgang Hürst User Interfaces for Speech-Based Retrieval of Lecture Recordings , 2004 .

[83]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[84]  Jean-Luc Gauvain,et al.  Transcription de la parole conversationnelle , 2004 .

[85]  Lori Lamel,et al.  Multilingual phone recognition of spontaneous telephone speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[86]  Elie el Khoury,et al.  Speaker Diarization: Towards a More Robust and Portable System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[87]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[88]  Martine Adda-Decker,et al.  The 300k LIMSI German broadcast news transcription system , 2003, INTERSPEECH.

[89]  Karen Spärck Jones,et al.  Automatic content-based retrieval of broadcast news , 1995, MULTIMEDIA '95.

[90]  Aides à la navigation dans un corpus de transcriptions d’oral , 2007, JEPTALNRECITAL.

[91]  Stefan Eickeler,et al.  Content-based video indexing of TV broadcast news using hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[92]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[93]  Cyril Allauzen,et al.  General Indexation of Weighted Automata - Application to Spoken Utterance Retrieval , 2004, HLT-NAACL 2004.

[94]  Steve Young,et al.  Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[95]  Frédéric Bimbot,et al.  Language modeling by variable length sequences: theoretical formulation and evaluation of multigrams , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[96]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[97]  Steve Young,et al.  The HTK book , 1995 .

[98]  Andreas Stolcke,et al.  The SRI/OGI 2006 spoken term detection system , 2007, INTERSPEECH.

[99]  Alexander I. Rudnicky,et al.  A texttiling based approach to topic boundary detection in meetings , 2006, INTERSPEECH.

[100]  Lukás Burget,et al.  Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[101]  Bhuvana Ramabhadran,et al.  Automatic recognition of spontaneous speech for access to multilingual oral history archives , 2004, IEEE Transactions on Speech and Audio Processing.

[102]  M. Sanderson,et al.  The relationship of word error rate to document ranking , 2003 .

[103]  Larson,et al.  Supporting radio archive workflows with vocabulary independent spoken keyword search , 2007 .

[104]  Daben Liu,et al.  Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[105]  Jean-Luc Gauvain,et al.  CallSurf: Automatic Transcription, Indexing and Structuration of Call Center Conversational Speech for Knowledge Extraction and Query by Content , 2008, LREC.

[106]  Martha Larson,et al.  Term clouds as surrogates for user generated speech , 2008, SIGIR '08.