Text clustering based on good aggregations

Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. We propose a new approach for applying background knowledge (in terms of an ontology) during preprocessing in order to improve clustering results and allow for selection between results. The results may be distinguished and explained by the corresponding selection of concepts in the ontology. Our results compare favourably with a sophisticated baseline preprocessing strategy.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  Andreas Hotho,et al.  Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing , 2000, DaWaK.

[4]  Günter Neumann,et al.  An Information Extraction Core System for Real World German Text Processing , 1997, ANLP.

[5]  Ashwin Ram,et al.  Efficient Feature Selection in Conceptual Clustering , 1997, ICML.

[6]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[7]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[8]  Steffen Staab,et al.  SEAL-II - The Soft Spot between Richly Structured and Unstructured Knowledge , 2001, J. Univers. Comput. Sci..

[9]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[10]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[13]  Steffen Staab,et al.  Ontology-based text clustering , 2001, IJCAI 2001.

[14]  Brian D. Davison,et al.  Human Performance on Clustering Web Pages: A Preliminary Study , 1998, KDD.

[15]  Daniel A. Keim,et al.  Visual mining of high-dimensional data , 1999 .

[16]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17]  Steffen Staab,et al.  GETESS - Searching the Web Exploiting German Texts , 1999, CIA.

[18]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.