CS369N: Beyond Worst-Case Analysis Lecture #3: Deterministic Planted Models for Clustering and Graph Partitioning ∗

Last lecture motivated the need for a model of data to gain insight into the relative merits of different online paging algorithms. In this lecture and the next, we explore models of data for clustering and graph partitioning problems. We cover deterministic data models in this lecture, and probabilistic ones in the next. In some optimization problems, the objective function can be taken quite literally. If one wants to maximize profit or accomplish some goal at minimum cost, then the goal translates directly into a numerical objective function. In other applications, an objective function is only means to an end. Consider, for example, the problem of clustering. Given a set of data points, the goal is to cluster them into “coherent groups”, with points in the same group being “similar” and those in different groups being “dissimilar”. There is not an obvious unique way to translate this goal into a numerical objective function, and as a result many different objective functions have been studied (k-means, k-median, k-center, etc.) with the intent of making the fuzzy notion of a “good/meaningful clustering” into a concrete optimization problem. In this case, we do not care about an “optimal solution” per se; rather, we want to uncover interesting structure in the data. So we’re perfectly happy to compute a “meaningful clustering” with suboptimal objective function value; and highly dissatisfied with an “optimal solution” that fails to indicate any patterns in the data (which suggests that we were asking the wrong question or expecting structure where none exists). The point is that if we are trying to cluster a data set, then we are implicitly assuming