Processing online news streams for large-scale semantic analysis

While Internet has enabled us to access a vast amount of online news articles originating from thousands of different sources, the human capability to read all these articles has stayed rather constant. Usually, the publishing industry takes over the role of filtering this enormous amount of information and presenting it in an appropriate way to the group of their subscribers. In this paper, the semantic analysis of such news streams is discussed by introducing a system that streams online news collected by the Europe Media Monitor to our proposed semantic news analysis system. Thereby, we describe in detail the emerging challenges and the corresponding engineering solutions to process incoming articles close to real-time. To demonstrate the use of our system, the case studies show a) temporal analysis of entities, such as institutions or persons, and b) their co-occurence in news articles.

[1]  Martin Wattenberg Baby names, visualization, and social data analysis , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[2]  William Ribarsky,et al.  NewsLab: Exploratory Broadcast News Video Analysis , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[3]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[4]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[5]  Bruno Pouliquen,et al.  Cross-lingual Named Entity Recognition , 2007 .

[6]  Steven Skiena,et al.  Lydia: A System for Large-Scale News Analysis , 2005, SPIRE.

[7]  Erik Van der Goot,et al.  Near real time information mining in multilingual news , 2009, WWW '09.

[8]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[9]  Bruno Pouliquen,et al.  Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation , 2006, LREC.

[10]  E.G. Hetzler,et al.  Turning the bucket of text into a pipe , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[11]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.