论文信息 - Topic discovery based on text mining techniques - 字舞流文

Topic discovery based on text mining techniques

In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been proposed to build the topic summaries. Experimental results in the TDT2 collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.

Rafael Berlanga Llavori | José Ruiz-Shulcloper | Aurora Pons-Porrata | J. Ruiz-Shulcloper | A. Pons-Porrata

[1] James Allan,et al. Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[2] George Karypis,et al. Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[3] Rafael Berlanga Llavori,et al. Extracting Temporal References to Assign Document Event-Time Periods , 2001, DEXA.

[4] Ido Dagan,et al. Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[5] Rafael Berlanga Llavori,et al. On-line event and topic detection by using the compact sets clustering algorithm , 2002, J. Intell. Fuzzy Syst..

[6] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[7] José Ruiz-Shulcloper,et al. An overview of the evolution of the concept of testor , 2001, Pattern Recognit..

[8] Inderjeet Mani,et al. Multi-Document Summarization by Graph Search and Matching , 1997, AAAI/IAAI.

[9] Rafael Berlanga Llavori,et al. A Method for the Automatic Summarization of Topic-Based Clusters of Documents , 2003, CIARP.

[10] Regina Barzilay,et al. Inferring Strategies for Sentence Ordering in Multidocument News Summarization , 2002, J. Artif. Intell. Res..

[11] Shuo Bai,et al. ICT ’ s Approaches to HTD and Tracking at TDT 2004 , 2004 .

[12] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[13] James Allan,et al. UMass at TDT 2004 , 2004 .

[14] Ada Wai-Chee Fu,et al. Incremental Document Clustering for Web Page Classification , 2002 .

[15] Rafael Berlanga Llavori,et al. Temporal-Semantic Clustering of Newspaper Articles for Event Detection , 2002, PRIS.

[16] Joe Carthy,et al. First Story Detection using a Composite Document Representation , 2001, HLT.

[17] J. Devore,et al. Statistics: The Exploration and Analysis of Data , 1986 .

[18] Yiming Yang,et al. Topic Detection and Tracking Pilot Study Final Report , 1998 .

[19] Oren Etzioni,et al. Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[20] James Allan,et al. Text classification and named entities for new event detection , 2004, SIGIR '04.

[21] David R. Karger,et al. Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[22] Helena Ahonen-Myka,et al. Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.

[23] Jonathan G. Fiscus,et al. NIST's 1998 topic detection and tracking evaluation (TDT2) , 1999, EUROSPEECH.

[24] George Karypis,et al. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.