Scaling Analysis of Affinity Propagation

We analyze and exploit some scaling properties of the affinity propagation (AP) clustering algorithm proposed by Frey and Dueck [Science 315, 972 (2007)]. Following a divide and conquer strategy we setup an exact renormalization-based approach to address the question of clustering consistency, in particular, how many cluster are present in a given data set. We first observe that the divide and conquer strategy, used on a large data set hierarchically reduces the complexity O(N2) to O(N((h+2)/(h+1))) , for a data set of size N and a depth h of the hierarchical strategy. For a data set embedded in a d -dimensional space, we show that this is obtained without notably damaging the precision except in dimension d=2 . In fact, for d larger than 2 the relative loss in precision scales such as N((2-d)/(h+1)d). Finally, under some conditions we observe that there is a value s* of the penalty coefficient, a free parameter used to fix the number of clusters, which separates a fragmentation phase (for ss*) of the underlying hidden cluster structure. At this precise point holds a self-similarity property which can be exploited by the hierarchical strategy to actually locate its position, as a result of an exact decimation procedure. From this observation, a strategy based on AP can be defined to find out how many clusters are present in a given data set.

[1]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[2]  Michèle Sebag,et al.  Data Streaming with Affinity Propagation , 2008, ECML/PKDD.

[3]  Blatt,et al.  Superparamagnetic clustering of data. , 1998, Physical review letters.

[4]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[5]  L. Haan,et al.  Extreme value theory , 2006 .

[6]  Michele Leone,et al.  Clustering by Soft-constraint Affinity Propagation: Applications to Gene-expression Data , 2022 .

[7]  大西 仁,et al.  Pearl, J. (1988, second printing 1991). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufmann. , 1994 .

[8]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[9]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[10]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[11]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[12]  M. Weigt,et al.  Unsupervised and semi-supervised clustering by message passing: soft-constraint affinity propagation , 2007, 0712.1165.

[13]  H. Bethe Statistical Theory of Superlattices , 1935 .

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Michèle Sebag,et al.  Toward autonomic grids: analyzing the job flow with affinity streaming , 2009, KDD.

[16]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[17]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[18]  Y. Kabashima Propagating beliefs in spin-glass models , 2002, cond-mat/0211500.

[19]  M. Mézard,et al.  Random K-satisfiability problem: from an analytic solution to an efficient algorithm. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  William Bialek,et al.  How Many Clusters? An Information-Theoretic Perspective , 2003, Neural Computation.

[21]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[22]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[23]  Tao Guo,et al.  Adaptive Affinity Propagation Clustering , 2008, ArXiv.

[24]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.