论文信息 - Scalable, continuous tracking of tag co-occurrences between short sets using (almost) disjoint tag partitions

Scalable, continuous tracking of tag co-occurrences between short sets using (almost) disjoint tag partitions

In this work we consider the continuous computation of set correlations over a stream of set-valued attributes, such as Tweets and their hashtags, social annotations of blog posts obtained through RSS, or updates to set-valued attributes of databases. In order to compute tag correlations in a distributed fashion, all necessary information has to be present at the computing node(s). Our approach makes use of a partitioning scheme based on set covers for efficient and replication-lean information flow. We report on the results of a preliminary performance evaluation using Tweets obtained through Twitter's streaming API.

Sebastian Michel | Foteini Alvanaki | F. Alvanaki | S. Michel

[1] Sven Helmer,et al. A performance study of four index structures for set-valued attributes of low cardinality , 2003, The VLDB Journal.

[2] Philippe Flajolet,et al. Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[3] Andrei Z. Broder,et al. A Derandomization Using Min-Wise Independent Permutations , 1998, RANDOM.

[4] Gerhard Weikum,et al. See what's enBlogue: real-time emergent topic identification in social media , 2012, EDBT '12.

[5] Leonardo Neumeyer,et al. S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[6] Lise Getoor,et al. On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch , 2009, SDM.

[7] Odysseas Papapetrou,et al. Sketch-based Querying of Distributed Sliding-Window Data Streams , 2012, Proc. VLDB Endow..

[8] Peter J. Haas,et al. Distinct-value synopses for multiset operations , 2009, CACM.