Deterministic annealing, clustering, and optimization

This work introduces the concept of deterministic annealing (DA) as a useful approach to clustering and other related optimization problems. It is strongly motivated by analogies to statistical physics, but is formally derived within information theory and probability theory. This approach enables escaping local optima that plague traditional techniques, without the extremely slow schedules typically required by stochastic methods. The clustering solutions obtained by DA are totally independent of the choice of initial configuration. A probabilistic framework is constructed, which is based on the principle of maximum entropy. The association probabilities at a given average distortion are Gibbs distributions parametrized by the corresponding Lagrange multiplier $\beta$, which is inversely proportional to the temperature in the physical analogy. By computing marginal probabilities, an effective cost is obtained, which is minimized to find the most probable set of cluster representatives at a given temperature. This effective cost is the free energy in statistical mechanics, which is indeed optimized at isothermal, stochastic equilibrium. Within the probabilistic framework, annealing is introduced by controlling the Lagrange multiplier $\beta$. This annealing is interpreted as gradually reducing the "fuzziness" of the associations. Phase transitions are identified in the process, which are, in fact, cluster splits. A sequence of phase transitions produces a hierarchy of fuzzy-clustering solutions. Critical $\beta$ are computed exactly for the first phase transition and approximately for the following ones. Specific algorithms are derivable from the general approach, to address different aspects of clustering. From the experimental results it appears that DA is substantially superior to traditional techniques. The approach is extended to deal with a larger family of optimization problems that can be reformulated as constrained clustering. A probabilistic framework for constrained clustering is derived. Three examples are discussed. Mass-constrained clustering yields an improvement of the clustering procedure. The process is now independent of the number and multiplicities of representatives. The travelling salesman problem is reformulated as constrained clustering, yielding the elastic net approach. A second Lagrange multiplier is identified and is controlled in the process. Finally, this approach is suggested for self-organization of neural networks.