Coresets for Vector Summarization with Applications to Network Graphs

We provide a deterministic data summarization algorithm that approximates the mean $\bar{p}=\frac{1}{n}\sum_{p\in P} p$ of a set $P$ of $n$ vectors in $\REAL^d$, by a weighted mean $\tilde{p}$ of a \emph{subset} of $O(1/\eps)$ vectors, i.e., independent of both $n$ and $d$. We prove that the squared Euclidean distance between $\bar{p}$ and $\tilde{p}$ is at most $\eps$ multiplied by the variance of $P$. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of $d$, and logarithmic in the $n$ vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings, in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the {\it Heavy Hitter} entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.

[1]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[2]  Dieter Fox,et al.  Location-Based Activity Recognition , 2005, KI.

[3]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[4]  Yu Zheng,et al.  Location-Based Social Networks: Users , 2011, Computing with Spatial Trajectories.

[5]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[6]  Nam P. Nguyen,et al.  An adaptive approximation algorithm for community detection in dynamic scale-free networks , 2013, 2013 Proceedings IEEE INFOCOM.

[7]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[8]  Martin G. Everett,et al.  Analyzing social networks , 2013 .

[9]  Taieb Znati,et al.  On Approximation of New Optimization Methods for Assessing Network Vulnerability , 2010, 2010 Proceedings IEEE INFOCOM.

[10]  Sariel Har-Peled,et al.  Coresets for Discrete Integration and Clustering , 2006, FSTTCS.

[11]  Dan Feldman,et al.  Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[12]  S. Wasserman Analyzing Social Networks as Stochastic Processes , 1980 .

[13]  Nam P. Nguyen,et al.  Adaptive algorithms for detecting community structure in dynamic social networks , 2011, 2011 Proceedings IEEE INFOCOM.

[14]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[15]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[16]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[17]  Tamir Tassa,et al.  More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data , 2015, KDD.

[18]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[19]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.