T-Sample: A Dual Reservoir-Based Sampling Method for Characterizing Large Graph Streams

Reservoir sampling is widely employed to characterize connectivity of large graph streams by producing edge samples. However, existing reservoir-based sampling methods mainly characterize large graph streams by a measure of counting triangles but perform poorly in accuracy when used to analyze the topological characteristics reflected by node degrees because they produce disconnected edge samples, making them ineffective in many applications that require both types of connectivity estimation simultaneously in real time. This paper proposes a new method, called triangle-induced reservoir sampling, or T-Sample, to produce connected edge samples. While every edge in a graph stream is still processed only once by T-Sample, a dual sampling mechanism performing both uniform sampling and non-uniform sampling is carefully designed with a base reservoir and an incremental reservoir. Specifically, the uniform sampling can be used to count triangles by employing the existing algorithms while the non-uniform sampling ensures that the edge samples are connected. Experimental results driven by real datasets show that T-Sample can obtain much more accurate estimations on the distributions of node degrees than the existing reservoir-based sampling methods.