100,000 Podcasts: A Spoken English Document Corpus

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

[1]  Kunpeng Zhang,et al.  A Baseline Analysis for Podcast Abstractive Summarization , 2020, ArXiv.

[2]  Giuseppe Carenini,et al.  Extractive Summarization of Long Documents by Combining Global and Local Context , 2019, EMNLP.

[3]  Martial Michel,et al.  The NIST Meeting Room Pilot Corpus , 2004, LREC.

[4]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[5]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Egidio L. Terra,et al.  Poison pills: harmful relevant documents in feedback , 2005, CIKM '05.

[7]  John Glover,et al.  A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal , 2020, ACL.

[8]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[9]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Richard Socher,et al.  An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition , 2020, INTERSPEECH.

[12]  Katja Hofmann,et al.  An Exploratory Study of User Goals and Strategies in Podcast Search , 2008, LWA.

[13]  Konstantinos Koumpis,et al.  Automatic summarization of voicemail messages using lexical and prosodic features , 2005, TSLP.

[14]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[15]  Feifan Liu,et al.  Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries , 2008, ACL.

[16]  Lynnelle Rhinier Brown,et al.  Requesting the Context: A Context Analysis of Let Statement and If Statement Requests and Commands in the Santa Barbara Corpus of Spoken American English , 2014 .

[17]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[18]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[19]  J. Steinberger,et al.  Using Latent Semantic Analysis in Text Summarization and Summary Evaluation , 2004 .

[20]  Gareth J. F. Jones,et al.  The CLEF 2003 Cross-Language Spoken Document Retrieval Track , 2003, CLEF.

[21]  Yoichiro Hasebe,et al.  Design and Implementation of an Online Corpus of Presentation Transcripts of TED Talks , 2015 .

[22]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[23]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[24]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[25]  Mark Sanderson,et al.  Extracting audio summaries to support effective spoken document search , 2017, J. Assoc. Inf. Sci. Technol..

[26]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[27]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[28]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[29]  Derek Miller,et al.  Leveraging BERT for Extractive Text Summarization on Lectures , 2019, ArXiv.

[30]  Maria Eskevich,et al.  New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval , 2012, ECIR.

[31]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[32]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[33]  Katja Hofmann,et al.  Podcast search: user goals and retrieval technologies , 2010, Online Inf. Rev..

[34]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[35]  Vincent Nguyen,et al.  Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation , 2020, LREC.

[36]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[37]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.