Finding Topics in News Web Pages by Parameter-free Clustering

Abstract Topic detection is a novel technology which structures news stories into several topics. Present topic detection approaches are mainly based on clustering algorithms such as single pass or agglomerative clus-tering, and all these algorithms need at least one input parameter. We proposed a novel clustering algorithm which auto-matically determines the parameters for each corpus. Experimental results show that the parameters derived are close to optimal, and our algorithm has similar accuracy as the UPGMA algorithm which is manually set with optimal parameters. Another advantage of our algorithm is that it runs much faster than the UPGMA algorithm. Keywords : Topic Detection, Clustering Algorithm, Parameter Free, Similarity Distribution 1. Introduction Nowadays web pages have become the fastest ways for us to achieve news and publish individual opinions. It is hard to identify what pages exactly we want. By giving a set of keywords to search engines such as Google and Yahoo!, we can obtain a very long list of URLs referring to Web pages. However, it is still a difficult work to grasp and summarize the contents quickly from the search results. We need a convenient and efficient way to learn “what’s happened” or “what’s hot” from large number of news web pages. Topic detection is such a method which automatically finds topics in a group of news corpuses. At present, topic detection is widely applied to web information organization, such as automatic construction of online news issues [1] or organization of RSS news from different sources on topics [2]. It is easy to think of the use of data mining algorithms to find topics. Actually, clustering is the core algorithm of present topic detection approaches. Common clustering algorithms used for topic detection include single pass, kNN and agglomerative clustering [1][3][4][5]. A drawback of these algorithms is that they can not run automatically without manual intervention. A common solution for the above problem is to use fixed parameters for all corpuses when they are similar in size and structure. A representative work on this approach is [1] which uses the UPGMA agglomerative clustering algorithm and specifies a fixed threshold to control the termination of the algorithm. But according to our experiments, when size or structure varies among corpuses, the optimal thresholds vary largely. We suppose that different corpuses have different optimal parameters regardless of which algorithm and parameters are used, and it is impossible for users to determine the optimal

[1]  Max Chevalier,et al.  Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage , 2008, Information Retrieval.

[2]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[3]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Qi He,et al.  Using Burstiness to Improve Clustering of Topics in News Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Min Zhang,et al.  Automatic online news issue construction in web environment , 2008, WWW.

[6]  James Allan,et al.  UMass at TDT 2004 , 2004 .

[7]  Najaf Ali Shah,et al.  Topic-based clustering of news articles , 2004, ACM-SE 42.

[8]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[9]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[10]  Clifford Stein,et al.  Clustering Data without Prior Knowledge , 2000, WAE.

[11]  Wei-Ying Ma,et al.  Multitype Features Coselection for Web Document Clustering , 2006, IEEE Trans. Knowl. Data Eng..

[12]  Christian Böhm,et al.  RIC: Parameter-free noise-robust clustering , 2007, TKDD.

[13]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[14]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.