On the Use of Consensus Clustering for Incremental Learning of Topic Hierarchies

Incremental learning of topic hierarchies is very useful to organize and manage growing text collections, thereby summarizing the implicit knowledge from textual data. However, currently available methods have some limitations to perform the incremental learning phase. In particular, when the initial topic hierarchy is not suitable for modeling the data, new documents are inserted into inappropriate topics and this error gets propagated into future hierarchy updates, thus decreasing the quality of the knowledge extraction process. We introduce a method for obtaining more robust initial topic hierarchies by using consensus clustering. Experimental results on several text collections show that our method significantly reduces the degradation of the topic hierarchies during the incremental learning compared to a traditional method.

[1]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[2]  H. B. Barlow,et al.  Unsupervised Learning , 1989, Neural Computation.

[3]  Roman Kern,et al.  Analysis of structural relationships for hierarchical cluster labeling , 2010, SIGIR '10.

[4]  George A. Vouros,et al.  Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes , 2011, J. Mach. Learn. Res..

[5]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[6]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[7]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[8]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[9]  Lior Rokach,et al.  A survey of Clustering Algorithms , 2010, Data Mining and Knowledge Discovery Handbook.

[10]  Lior Rokach,et al.  Data Mining and Knowledge Discovery Handbook, 2nd ed , 2010, Data Mining and Knowledge Discovery Handbook, 2nd ed..

[11]  David Sánchez,et al.  Creating Topic Hierarchies for Large Medical Libraries , 2009, KR4HC.

[12]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[13]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[14]  Ricardo M. Marcacini,et al.  Incremental Construction of Topic Hierarchies using Hierarchical Term Clustering , 2010, SEKE.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[16]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[17]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[18]  Mohamed S. Kamel,et al.  Topic Discovery from Text Using Aggregation of Different Clustering Methods , 2002, Canadian Conference on AI.

[19]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[20]  Weimao Ke,et al.  Dynamicity vs. effectiveness: studying online clustering for scatter/gather , 2009, SIGIR.

[21]  Silvia Miksch,et al.  Knowledge Representation for Health Care , 2014, Lecture Notes in Computer Science.

[22]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[23]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.