Identifying Clusters from Positive Data

The present work studies clustering from an abstract point of view and investigates its properties in the framework of inductive inference. Any class $S$ considered is given by a hypothesis space, i.e., numbering, $A_0,A_1,\ldots$ of nonempty recursively enumerable (r.e.) subsets of ${\mathbb{N}}$ or ${\mathbb{Q}^k}$. A clustering task is a finite and nonempty set of r. e. indices of pairwise disjoint such sets. The class $S$ is said to be clusterable if there is an algorithm which, for every clustering task $I$, converges in the limit on any text for $\bigcup_{i \in I} A_i$ to a finite set $J$ of indices of pairwise disjoint clusters such that $\bigcup_{j \in J} A_j = \bigcup_{i \in I} A_i$. A class is called semiclusterable if there is such an algorithm which finds a $J$ with the last condition relaxed to $\bigcup_{j \in J} A_j \supseteq \bigcup_{i \in I} A_i$. The relationship between natural topological properties and clusterability is investigated. Topological properties can provide sufficient or necessary conditions for clusterability, but they cannot characterize it. On the one hand, many interesting conditions make use of both the topological structure of the class and a well-chosen numbering. On the other hand, the clusterability of a class does not depend on which numbering of the class is used as a hypothesis space for the\break clusterer. These ideas are demonstrated in the context of naturally geometrically defined classes. Besides the text for the clustering task, clustering of many of these classes requires the following additional information: the class of convex hulls of finitely many points in a rational vector space can be clustered with the number of clusters as additional information. Interestingly, the class of polygons (together with their interiors) is clusterable if the number of clusters and the overall number of vertices of these clusters is given to the clusterer as additional information. Intriguingly, this additional information is not sufficient for classes including figures with holes. While some classes are unclusterable due to their topological structure, others are only computationally intractable. An oracle might permit clustering all computationally intractable clustering tasks but fail on some classes which are topologically difficult. It is shown that an oracle $E$ permits clustering all computationally difficult classes iff $E \geq_T K \wedge E' \geq_T K"$. Furthermore, no 1-generic oracle below $K$ and no 2-generic oracle permits clustering any class which is not clusterable without an oracle.

[1]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Arun Sharma,et al.  On the Non-Existence of Maximal Inference Degrees for Language Identification , 1993, Inf. Process. Lett..

[5]  Robert H. Sloan,et al.  BOOK REVIEW: "SYSTEMS THAT LEARN: AN INTRODUCTION TO LEARNING THEORY, SECOND EDITION", SANJAY JAIN, DANIEL OSHERSON, JAMES S. ROYER and ARUN SHARMA , 2001 .

[6]  Frank Stephan,et al.  On the structure of degrees of inferability , 1993, COLT '93.

[7]  P. Odifreddi Classical recursion theory , 1989 .

[8]  Manuel Blum,et al.  Inductive inference and unsolvability , 1991, Journal of Symbolic Logic.

[9]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[10]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[11]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.

[12]  Daniel N. Osherson,et al.  Systems That Learn: An Introduction to Learning Theory for Cognitive and Computer Scientists , 1990 .

[13]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[14]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[15]  Stuart A. Kurtz,et al.  Extremes in the Degrees of Inferability , 1994, Ann. Pure Appl. Log..

[16]  Mark A. Fulk Prudence and Other Conditions on Formal Language Learning , 1990, Inf. Comput..

[17]  Bhaba R. Sarker,et al.  Discrete location theory , 1991 .

[18]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[19]  Arun Sharma,et al.  Learning with the Knowledge of an Upper Bound on Program Size , 1993, Inf. Comput..

[20]  Dick de Jongh,et al.  Angluin's theorem for indexed families of r.e. sets and applications , 1996, COLT '96.

[21]  Rolf Wiehagen,et al.  Inductive Inference with Additional Information , 1979, J. Inf. Process. Cybern..