Characterization of subsets of data is a recurring problem in data mining. We propose a keyword selection method that can be used for obtaining characterizations of clusters of data whenever textual descriptions can be associated with the data. Several methods that cluster data sets or form projections of data provide an order or distance measure of the clusters. If such an ordering of the clusters exists or can be deduced, the method utilizes the order to improve the characterizations. The proposed method may be applied, for example, to characterizing graphical displays of collections of data ordered (e.g. with SOM algorithm). The method is validated using a collection of 10000 scientific abstracts from the INSPEC database organized on a WEBSOM document map.
[1]
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
,
1988,
Inf. Process. Manag..
[2]
Samuel Kaski,et al.
Self organization of a massive text document collection
,
1999
.
[3]
Gerard Salton,et al.
A vector space model for automatic indexing
,
1975,
CACM.
[4]
Kenneth Ward Church,et al.
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
,
1995,
VLC@ACL.
[5]
Timo Honkela,et al.
Newsgroup Exploration with WEBSOM Method and Browsing Interface
,
1996
.
[6]
Timo Honkela,et al.
WEBSOM - Self-organizing maps of document collections
,
1998,
Neurocomputing.