A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assigns a real number between 0 and 1 to a pair of documents, depending upon the degree of similarity between them. A value of zero means that the documents are completely dissimilar whereas a value of one indicates that the documents are practically identical. Traditionally, vector-based models have been used for computing the document similarity. The vector-based models represent several features present in documents. These approaches to similarity measures, in general, cannot account for the semantics of the document. Documents written in human languages contain contexts and the words used to describe these contexts are generally semantically related. Motivated by this fact, many researchers have proposed seman-tic-based similarity measures by utilizing text annotation through external thesauruses like WordNet (a lexical database). In this paper, we define a semantic similarity measure based on documents represented in topic maps. Topic maps are rapidly becoming an industrial standard for knowledge representation with a focus for later search and extraction. The documents are transformed into a topic map based coded knowledge and the similarity between a pair of documents is represented as a correlation between the common patterns (sub-trees). The experimental studies on the text mining datasets reveal that this new similarity measure is more effective as compared to commonly used similarity measures in text clustering.
[1]
Ted Pedersen,et al.
WordNet::Similarity - Measuring the Relatedness of Concepts
,
2004,
NAACL.
[2]
Anna-Lan Huang,et al.
Similarity Measures for Text Document Clustering
,
2008
.
[3]
Steve Pepper.
Topic Maps
,
2004
.
[4]
Diana McCarthy,et al.
Relating WordNet Senses for Word Sense Disambiguation
,
2006
.
[5]
Ke Wang,et al.
Hierarchical Document Clustering
,
2009,
Encyclopedia of Data Warehousing and Mining.
[6]
Marc Wilhelm Küster,et al.
Scaling Topic Maps
,
2007,
TMRA.
[7]
Jianhua Lin,et al.
Divergence measures based on the Shannon entropy
,
1991,
IEEE Trans. Inf. Theory.
[8]
Fakhri Karray,et al.
AN EFFICIENT MODEL FOR ENHANCING TEXT CATEGORIZATION USING SENTENCE SEMANTICS
,
2010,
Comput. Intell..
[9]
Christopher Meek,et al.
Improving Similarity Measures for Short Segments of Text
,
2007,
AAAI.
[10]
Shady Shehata,et al.
A WordNet-Based Semantic Model for Enhancing Text Clustering
,
2009,
2009 IEEE International Conference on Data Mining Workshops.
[11]
R. Mooney,et al.
Impact of Similarity Measures on Web-page Clustering
,
2000
.