The 2-catalog segmentation problem

We study the ~-CATALOG SEGMENTATION problem: GivenasetIofnitemsandafamiiyS= {Sl,S2,...,Sp} of subsets of I, find Cl, C2 E I such that ICI ],lC21 I r and the sum CL1 ma{ ISifICl], ISinC2]} is maximized. The problem was recently introduced by Kleinberg et aE {3] and is motivated by several applications to data mining and clustering operations as detailed in [3]. Under the restriction that ISil = n(]I]) for each Siy the authors give a PTAS. But, in general, only a trivial 0.5 approximation algorithm is known -just define Cl to be r most frequently occurring elements. The question of improving upon this factor was posed as an important open problem in [3]. We make some progress towards answering this question, under the assumption that the size of the collection I is bounded by 2r, i.e. twice the catalog size. Our motivation comes from the following natural restriction of the ~-CATALOG SEGMENTATION problem: we are given Cl UC2, that is, the elements that comprise the optimal solution. How well can one approximate the optimal partition in this case? This variation of the problem appears to be a natural intermediate step towards solving the original problem and we hope that understanding it will shed some light on the structure of the general problem. By relating this problem to the well-known problem of MINIMUM BISECTIONS, we show that this restricted segmentation problem is NP-hard, even under the assumption that each Si contains at most 2 elements. Using a semidefinite programming relaxation, similar to the one used by Frieze and Jerrum [l], we obtain a 0.565-approximation for this restriction of the problem. We obtain a ratio of 0.651 when the two catalogs are required to be disjoint; this variant is referred to as the DISJOINT ~-CATALOG SEGMENTATION problem. We also obtain some results for the general ~-CATALOG SEGMENTATION problem when the sets Si E S are bounded in size by a small constant. Finally, we obtain a 0.54approximation for the DENSEST PARTITION problem defined in Section 4.