The BBC World Service Archive prototype

Most broadcasters have accumulated large audio and video archives stretching back over many decades. For example the BBC World Service radio archive includes around 70,000 English-language programmes from over 45źyears. This amounts to about three years of continuous audio and around 15źTB of data. The metadata around this archive is sparse and sometimes wrong, but the full audio content is available in digital form. We have built a system to process the existing audio and text and automatically annotate programmes within the archive with Linked Data web identifiers. The resulting interlinks are used to bootstrap search and navigation within this archive and expose it to users. Automated data will never be entirely accurate so we built crowdsourcing mechanisms for users to correct and add data. The resulting crowdsourced data is then used to improve search and navigation within the archive, as well as evaluate and improve our algorithms. As a result of this feedback cycle, the interlinks between our archive and the Semantic Web are continuously improving. This unique combination of Semantic Web technologies, automation and crowdsourcing has dramatically reduced the amount of time and effort required to publish this rich archive online. The BBC World Service archive prototype is available online at http://worldservice.prototyping.bbc.co.uk, last accessed March 2014.

[1]  C. Lintott,et al.  Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey , 2008, 0804.4483.

[2]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[3]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[4]  M. Slaney,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[5]  Luis von Ahn Games with a Purpose , 2006, Computer.

[6]  Lora Aroyo,et al.  Emerging Practices in the Cultural Heritage Domain - Social Tagging of Audiovisual Heritage , 2010 .

[7]  Yves Raimond,et al.  Automated interlinking of speech radio archives , 2012, LDOW.

[8]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[9]  Yves Raimond,et al.  Using the Past to Explain the Present: Interlinking Current Affairs with Archives via the Semantic Web , 2013, SEMWEB.

[10]  Yves Raimond,et al.  Automated Metadata Enrichment of Large Speech Radio Archives , 2014 .

[11]  Christian Bizer,et al.  Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections , 2009, ESWC.

[12]  Michael A. Casey,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors , 2008 .

[13]  Jennifer Trant,et al.  Tagging, Folksonomy and Art Museums: Results of steve.museum's research , 2009 .

[14]  Michiel Hildebrand,et al.  Waisda?: video labeling game , 2013, MM '13.

[15]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[16]  Yves Raimond,et al.  Identifying contributors in the BBC world service archive , 2014, INTERSPEECH.

[17]  Rose Holley Many Hands Make Light Work : Public Collaborative OCR Text Correction in Australian Historic Newspapers , 2009 .