Ranking a stream of news

According to a recent survey made by Nielsen NetRatings, searching on news articles is one of the most important activity online. Indeed, Google, Yahoo, MSN and many others have proposed commercial search engines for indexing news feeds. Despite this commercial interest, no academic research has focused on ranking a stream of news articles and a set of news sources. In this paper, we introduce this problem by proposing a ranking framework which models: (1) the process of generation of a stream of news articles, (2) the news articles clustering by topics, and (3) the evolution of news story over the time. The ranking algorithm proposed ranks news information, finding the most authoritative news sources and identifying the most interesting events in the different categories to which news article belongs. All these ranking measures take in account the time and can be obtained without a predefined sliding window of observation over the stream. The complexity of our algorithm is linear in the number of pieces of news still under consideration at the time of a new posting. This allow a continuous on-line process of ranking. Our ranking framework is validated on a collection of more than 300,000 pieces of news, produced in two months by more then 2000 news sources belonging to 13 different categories (World, U.S, Europe, Sports, Business, etc). This collection is extracted from the index of comeToMyHead, an academic news search engine available online.

[1]  Zhu Zhang,et al.  Interactive, Domain-Independent Identification and Summarization of Topically Related News Articles , 2001, ECDL.

[2]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[3]  Mikhail J. Atallah,et al.  Detection of significant sets of episodes in event sequences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[4]  Susan T. Dumais,et al.  Newsjunkie: providing personalized newsfeeds via analysis of information novelty , 2004, WWW '04.

[5]  Dennis McLeod,et al.  Dynamic Topic Mining from News Stream Data , 2003, OTM.

[6]  Dianne P. O'Leary,et al.  QCS: A Tool for Querying, Clustering, and Summarizing Documents , 2003, HLT-NAACL.

[7]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[8]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[9]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[10]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[11]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[12]  Zhu Zhang,et al.  NewsInEssence: A System For Domain-Independent, Real-Time News Clustering and Multi-Document Summarization , 2001, HLT.

[13]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[14]  Monika Henzinger,et al.  Query-Free News Search , 2003, WWW '03.