A probabilistic theory of clustering

Abstract Data clustering is typically considered a subjective process, which makes it problematic. For instance, how does one make statistical inferences based on clustering? The matter is different with pattern classification, for which two fundamental characteristics can be stated: (1) the error of a classifier can be estimated using “test data,” and (2) a classifier can be learned using “training data.” This paper presents a probabilistic theory of clustering, including both learning (training) and error estimation (testing). The theory is based on operators on random labeled point processes. It includes an error criterion in the context of random point sets and representation of the Bayes (optimal) cluster operator for a given random labeled point process. Training is illustrated using a nearest-neighbor approach, and trained cluster operators are compared to several classical clustering algorithms.

[1]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[3]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[4]  Pat Langley,et al.  Generalized clustering, supervised learning, and data assignment , 2001, KDD '01.

[5]  T. Mattfeldt Stochastic Geometry and Its Applications , 1996 .

[6]  A. Hibbs QED: The Strange Theory of Light and Matter , 1986 .

[7]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[10]  Roberto Marcondes Cesar Junior,et al.  Inference from Clustering with Application to Gene-Expression Microarrays , 2002, J. Comput. Biol..

[11]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[12]  Edward R. Dougherty,et al.  Random processes for image and signal processing , 1998, SPIE / IEEE series on imaging science and engineering.

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.