Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents

The main idea of this research is to first design the similarity measure which can be used to of find the similarity between any two text documents and use the same to perform clustering. The similarity measure designed is analyzed to study the behavior in the best case, average case and worst case situations. The drawback of Euclidean, Cosine, Jaccard similarity measures are overcome using the proposed measure. The similarity measure is evaluated considering reuters-21578 dataset. The results show that the proposed measure overcomes other measures.

[1]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[2]  Charles L. A. Clarke,et al.  Effective measures for inter-document similarity , 2013, CIKM.

[3]  Andrej Bauer,et al.  Similarity Measures for Relational Databases , 2009, Informatica.

[4]  Chien-Hsing Chou,et al.  A New Measure of Cluster Validity Using Line Symmetry , 2014, J. Inf. Sci. Eng..

[5]  Soon Myoung Chung,et al.  Text document clustering based on neighbors , 2009, Data Knowl. Eng..

[6]  Swati Hatekar,et al.  A Similarity Measure for Text Classification and Clustering , 2016 .

[7]  A. Ananda Rao,et al.  A frequent term based text clustering approach using novel similarity measure , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[8]  Sunghae Jun,et al.  Document clustering method using dimension reduction and support vector clustering to overcome sparseness , 2014, Expert Syst. Appl..

[9]  Hui Xiong,et al.  Discovery of maximum length frequent itemsets , 2008, Inf. Sci..

[10]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[11]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[12]  Felix Naumann,et al.  Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate , 2011, CIKM '11.

[13]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[14]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[15]  P. Harini,et al.  A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification , 2012 .

[16]  Gang Kou,et al.  Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data , 2012, Ann. Oper. Res..

[17]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[18]  Xijin Tang,et al.  Text clustering using frequent itemsets , 2010, Knowl. Based Syst..