IntoNews: Online news retrieval using closed captions

Abstract We present I nto N ews , a system to match online news articles with spoken news from a television newscasts represented by closed captions. We formalize the news matching problem as two independent tasks: closed captions segmentation and news retrieval. The system segments closed captions by using a windowing scheme: sliding or tumbling window. Next, it uses each segment to build a query by extracting representative terms. The query is used to retrieve previously indexed news articles from a search engine. To detect when a new article should be surfaced, the system compares the set of retrieved articles with the previously retrieved one. The intuition is that if the difference between these sets is large enough, it is likely that the topic of the newscast currently on air has changed and a new article should be displayed to the user. In order to evaluate I nto N ews , we build a test collection using data coming from a second screen application and a major online news aggregator. The dataset is manually segmented and annotated by expert assessors, and used as our ground truth. It is freely available for download through the Webscope program. 1 Our evaluation is based on a set of novel time-relevance metrics that take into account three different aspects of the problem at hand: precision, timeliness and coverage. We compare our algorithms against the best method previously proposed in literature for this problem. Experiments show the trade-offs involved among precision, timeliness and coverage of the airing news. Our best method is four times more accurate than the baseline.

[1]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[2]  Ryota Tomioka,et al.  Discovering Emerging Topics in Social Streams via Link-Anomaly Detection , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  Donna K. Harman,et al.  Novelty Detection: The TREC Experience , 2005, HLT.

[4]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[5]  Peter Mika,et al.  Searching through time in the New York Times HCIR Challenge 2010 , 2010 .

[6]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[7]  R. Papka On-line New Event Detection, Clustering, and Tracking TITLE2: , 1999 .

[8]  Bin Ma,et al.  Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions , 2013, ACL.

[9]  Douglas W. Oard,et al.  The State of the Art in Text Filtering , 1997, User Modeling and User-Adapted Interaction.

[10]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[11]  Monika Henzinger,et al.  Query-free news search , 2003, WWW.

[12]  Chih-Ping Wei,et al.  Exploiting temporal characteristics of features for effectively discovering event episodes from news corpora , 2014, J. Assoc. Inf. Sci. Technol..

[13]  Roi Blanco,et al.  Towards leveraging closed captions for news retrieval , 2013, WWW '13 Companion.

[14]  Thorsten Brants,et al.  Optimizing Story Link Detection is not Equivalent to Optimizing New Event Detection , 2003, ACL.

[15]  Gianmarco De Francisci Morales,et al.  Online matching of web content to closed captions in IntoNow , 2013, SIGIR.

[16]  Maarten de Rijke,et al.  Feeding the Second Screen: Semantic Linking based on Subtitles , 2013, DIR.

[17]  Hans Peter Luhn,et al.  A Business Intelligence System , 1958, IBM J. Res. Dev..

[18]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[19]  Alistair Moffat,et al.  The design of a high performance information filtering system , 1996, SIGIR '96.

[20]  John Yearwood,et al.  Automated opinion detection: Implications of the level of agreement between human raters , 2010, Inf. Process. Manag..

[21]  Roi Blanco,et al.  Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.

[22]  Peter J. Denning,et al.  ACM president's letter: electronic junk , 1982, CACM.

[23]  Mark T. Maybury,et al.  Broadcast news navigation using story segmentation , 1997, MULTIMEDIA '97.

[24]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[25]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[26]  Yoichi Shinoda,et al.  Information filtering based on user behavior analysis and best match text retrieval , 1994, SIGIR '94.

[27]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[28]  Michael J. Witbrock,et al.  Story segmentation and detection of commercials in broadcast news video , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[29]  Hinrich Schütze,et al.  Method combination for document filtering , 1996, SIGIR '96.

[30]  Christian Wolff,et al.  An event processing approach to text stream analysis: basic principles of event based information filtering , 2014, DEBS '14.

[31]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[32]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[33]  Manolis Koubarakis,et al.  Information filtering and query indexing for an information retrieval model , 2009, TOIS.

[34]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[35]  Aristides Gionis,et al.  From chatter to headlines: harnessing the real-time web for personalized news recommendation , 2012, WSDM '12.