A Novel K-Means Clustering Algorithm Based on Positive Examples and Careful Seeding

Positive and unlabeled learning (PU Learning) is a special semi-supervise learning method. Its most important feature is that training set only includes two parts: positive examples and unlabeled examples. Many real-world classification applications appeal to PU Learning problem. The K-means++ clustering algorithm proposed a new seeding method. This paper describes a semi-supervised learning algorithm for positive and unlabeled examples (PU learning). Our approach extends K-means++, an enhancement to K-means that seeds the algorithm with suitably chosen cluster centers, to such situations. The experiments on the Spam and 20-newsgroup data sets shown that our proposed algorithm has better performances.

[1]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[2]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[3]  Keith L. Clark,et al.  An Experimental Study of Feature Selection Methods for Text Classification , 2008, Personalization Techniques and Recommender Systems.

[4]  Witold Pedrycz,et al.  Algorithms of fuzzy clustering with partial supervision , 1985, Pattern Recognit. Lett..

[5]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[6]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[7]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[8]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[9]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[10]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[11]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[12]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.