Feeding the Second Screen: Semantic Linking based on Subtitles

Television broadcasts are increasingly consumed on an interactive device or with such a device in the vicinity. Around 70% of tablet and smartphone owners use their devices while watching television [11]. This allows broadcasters to provide consumers with additional background information that they may bookmark for later consumption in applications such as depicted in Figure 1. For live television, edited broadcast-specific content to be used on second screens is hard to prepare in advance. We present an approach for automatically generating links to background information in real-time, to be used on second screens. We base our semantic linking approach for television broadcasts on subtitles and Wikipedia, thereby effectively casting the task as one of identifying and generating links for elements in the stream of subtitles. The process of automatically generating links to Wikipedia is commonly known as semantic linking and has received much attention in recent years [3, 6, 7, 9, 10]. Such links are typically explanatory, enriching the link source with definitions or background information [2, 4]. Recent work has considered semantic linking for short texts such as queries and microblogs [6‐8]. The performance of generic methods for semantic linking deteriorates in such settings, as language usage is creative and context virtually absent. While link generation has received considerable attention in recent years, our task has unique demands that require an approach that needs to (i) be high-precision oriented, (ii) perform in realtime, (iii) work in a streaming setting, and (iv) typically, with a very limited context. We propose a learning to rerank approach to improve upon a strong baseline retrieval model for generating links from streaming text. In addition, we model context using a graph-based approach. This approach is particularly appropriate in our setting as it allows us to combine a number of context-based signals in streaming text and capture the core topics relevant for a broadcast, while allowing real-time updates to reflect the progression of topics being dealt with in the broadcast. Our graph-based context model is highly accurate, fast, allows us to disambiguate between candidate links and capture the context as it is being built up. Our main contribution is a set of effective feature-based methods for performing real-time semantic linking. We show how a learning to rerank approach for semantic linking performs on the task of real-time semantic linking, in terms of effectiveness and efficiency. We extend this approach with a graph-based method to keep track of context in a textual stream and show how this can further

[1]  David Ellis,et al.  On the Creation of Hypertext Links in Full-Text Documents: Measurement of Inter-Linker Consistency , 1994, J. Documentation.

[2]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[3]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[4]  Stephen J. Green,et al.  Building Hypertext Links By Computing Semantic Similarity , 1999, IEEE Trans. Knowl. Data Eng..

[5]  Andrew Y. Ng,et al.  Learning random walk models for inducing word dependency distributions , 2004, ICML.

[6]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[7]  M. de Rijke,et al.  Mapping queries to the Linking Open Data cloud: A case study using DBpedia , 2011, J. Web Semant..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Valentin Jijkoun,et al.  Named entity normalization in user generated content , 2008, AND '08.

[10]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[11]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[12]  James Allan,et al.  Automatic Hypertext Construction , 1995 .

[13]  M. de Rijke,et al.  Learning Semantic Query Suggestions , 2009, SEMWEB.

[14]  James W. Cooper,et al.  Towards speech as a knowledge resource , 2001, CIKM '01.

[15]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[16]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[17]  M. de Rijke,et al.  Linking Archives Using Document Enrichment and Term Selection , 2011, TPDL.

[18]  Chris Brew,et al.  Spectral Clustering for German Verbs , 2002, EMNLP.

[19]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[20]  M. de Rijke,et al.  Generating links to background knowledge: a case study using narrative radiology reports , 2011, CIKM '11.

[21]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[22]  Martha Larson,et al.  Overview of VideoCLEF 2009: New Perspectives on Speech-based Multimedia Content Enrichment , 2009, CLEF.

[23]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[24]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[25]  Kilian Q. Weinberger,et al.  Web-Search Ranking with Initialized Gradient Boosted Regression Trees , 2010, Yahoo! Learning to Rank Challenge.

[26]  M. de Rijke,et al.  Discovering missing links in Wikipedia , 2005, LinkKDD '05.