论文信息 - 100,000 Podcasts: A Spoken English Document Corpus

100,000 Podcasts: A Spoken English Document Corpus

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

[1] Kunpeng Zhang,et al. A Baseline Analysis for Podcast Abstractive Summarization , 2020, ArXiv.

[2] Giuseppe Carenini,et al. Extractive Summarization of Long Documents by Combining Global and Local Context , 2019, EMNLP.

[3] Martial Michel,et al. The NIST Meeting Room Pilot Corpus , 2004, LREC.

[4] Ellen M. Voorhees,et al. The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[5] Andreas Stolcke,et al. The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6] Egidio L. Terra,et al. Poison pills: harmful relevant documents in feedback , 2005, CIKM '05.

[7] John Glover,et al. A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal , 2020, ACL.

[8] Rada Mihalcea,et al. TextRank: Bringing Order into Text , 2004, EMNLP.

[9] Jimmy J. Lin,et al. Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[10] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11] Richard Socher,et al. An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition , 2020, INTERSPEECH.

[12] Katja Hofmann,et al. An Exploratory Study of User Goals and Strategies in Podcast Search , 2008, LWA.

[13] Konstantinos Koumpis,et al. Automatic summarization of voicemail messages using lexical and prosodic features , 2005, TSLP.

[14] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[15] Feifan Liu,et al. Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries , 2008, ACL.

[16] Lynnelle Rhinier Brown,et al. Requesting the Context: A Context Analysis of Let Statement and If Statement Requests and Commands in the Santa Barbara Corpus of Spoken American English , 2014 .

[17] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.