Semi-supervised Probabilistic Distance Clustering and the Uncertainty of Classification

Semi-supervised clustering is an attempt to reconcile clustering (unsupervised learning) and classification (supervised learning, using prior information on the data). These two modes of data analysis are combined in a parameterized model, the parameter θ ∈ [0, 1] is the weight attributed to the prior information, θ = 0 corresponding to clustering, and θ = 1 to classification. The results (cluster centers, classification rule) depend on the parameter θ, an insensitivity to θ indicates that the prior information is in agreement with the intrinsic cluster structure, and is otherwise redundant. This explains why some data sets (such as the Wisconsin breast cancer data, Merz and Murphy, UCI repository of machine learning databases, University of California, Irvine, CA) give good results for all reasonable classification methods. The uncertainty of classification is represented here by the geometric mean of the membership probabilities, shown to be an entropic distance related to the Kullback–Leibler divergence.

[1]  Marina Arav,et al.  CONTOUR APPROXIMATION OF DATA AND THE HARMONIC MEAN , 2008 .

[2]  Adi Ben-Israel,et al.  Contour approximation of data: A duality theory , 2009 .

[3]  Marc Teboulle,et al.  A Unified Continuous Optimization Framework for Center-Based Clustering Methods , 2007, J. Mach. Learn. Res..

[4]  F. G. Badía,et al.  Preservation of reliability classes under mixtures of renewal processes , 2008 .

[5]  O. Mangasarian,et al.  Pattern Recognition Via Linear Programming: Theory and Application to Medical Diagnosis , 1989 .

[6]  J. Aczél Measuring information beyond communication theory—Why some generalized information measures may be useful, others not , 1984 .

[7]  Harold W. Kuhn,et al.  A note on Fermat's problem , 1973, Math. Program..

[8]  M. Teboulle,et al.  Certainty Equivalents and Information Measures: Duality and Extremal Principles , 1991 .

[9]  Kenneth G. Manton,et al.  Fuzzy Cluster Analysis , 2005 .

[10]  Cem Iyigun,et al.  Probabilistic D-Clustering , 2008, J. Classif..

[11]  Marc Teboulle,et al.  Penalty Functions and Duality in Stochastic Programming Via ϕ-Divergence Functionals , 1987, Math. Oper. Res..

[12]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[13]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[14]  M. Shirosaki Another proof of the defect relation for moving targets , 1991 .

[15]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[16]  J. C. Peters,et al.  Fuzzy Cluster Analysis : A New Method to Predict Future Cardiac Events in Patients With Positive Stress Tests , 1998 .

[17]  Tomer Hertz,et al.  Learning a Mahalanobis Metric from Equivalence Constraints , 2005, J. Mach. Learn. Res..

[18]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[19]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[21]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[22]  Adi Ben-Israel,et al.  PROBABILISTIC DISTANCE CLUSTERING ADJUSTED FOR CLUSTER SIZE , 2008, Probability in the Engineering and Informational Sciences.

[23]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[24]  K. Dixon,et al.  Harmonic mean measure of animal activity areas , 1980 .

[25]  R. Duncan Luce,et al.  Individual Choice Behavior , 1959 .

[26]  J. Yellott Luce's Choice Axiom , 2001 .

[27]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[28]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[29]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .