Topic discovery based on text mining techniques

In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been proposed to build the topic summaries. Experimental results in the TDT2 collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.

[1]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[2]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[3]  Rafael Berlanga Llavori,et al.  Extracting Temporal References to Assign Document Event-Time Periods , 2001, DEXA.

[4]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[5]  Rafael Berlanga Llavori,et al.  On-line event and topic detection by using the compact sets clustering algorithm , 2002, J. Intell. Fuzzy Syst..

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  José Ruiz-Shulcloper,et al.  An overview of the evolution of the concept of testor , 2001, Pattern Recognit..

[8]  Inderjeet Mani,et al.  Multi-Document Summarization by Graph Search and Matching , 1997, AAAI/IAAI.

[9]  Rafael Berlanga Llavori,et al.  A Method for the Automatic Summarization of Topic-Based Clusters of Documents , 2003, CIARP.

[10]  Regina Barzilay,et al.  Inferring Strategies for Sentence Ordering in Multidocument News Summarization , 2002, J. Artif. Intell. Res..

[11]  Shuo Bai,et al.  ICT ’ s Approaches to HTD and Tracking at TDT 2004 , 2004 .

[12]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[13]  James Allan,et al.  UMass at TDT 2004 , 2004 .

[14]  Ada Wai-Chee Fu,et al.  Incremental Document Clustering for Web Page Classification , 2002 .

[15]  Rafael Berlanga Llavori,et al.  Temporal-Semantic Clustering of Newspaper Articles for Event Detection , 2002, PRIS.

[16]  Joe Carthy,et al.  First Story Detection using a Composite Document Representation , 2001, HLT.

[17]  J. Devore,et al.  Statistics: The Exploration and Analysis of Data , 1986 .

[18]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[19]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[20]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[21]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[22]  Helena Ahonen-Myka,et al.  Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.

[23]  Jonathan G. Fiscus,et al.  NIST's 1998 topic detection and tracking evaluation (TDT2) , 1999, EUROSPEECH.

[24]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.