A Cloud-Based Parallel Space-Saving Algorithm for Big Networking Data

As the network continues to evolve, completely analyzing the traffic requires immeasurable resources. In situations of processing enormous streaming data, the most significant k items (Top-k) are more interesting, and some streaming algorithms are deployed due to relatively limited memory and also limited processing time per item. Space-saving is such one of the most popular algorithms for computation of frequent and Top-k elements in data streams. In this paper, this algorithm is implemented in the cloud for analyzing big networking data, and an empirical formula of the counter number is derived for efficiently maintaining Top-k items. Meanwhile, easily understandable proof manner is presented to prove the merging ability of Space-saving algorithm, and some experiments are conducted to affirm the effectiveness of the algorithm.

[1]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[2]  Graham Cormode,et al.  Mergeable summaries , 2012, PODS '12.

[3]  Robert S. Boyer,et al.  MJRTY: A Fast Majority Vote Algorithm , 1991, Automated Reasoning: Essays in Honor of Woody Bledsoe.

[4]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[5]  Yunjun Gao,et al.  Novel structures for counting frequent items in time decayed streams , 2017, World Wide Web.

[6]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[7]  Marco Pulimeno,et al.  On Frequency Estimation and Detection of Frequent Items in Time Faded Streams , 2017, IEEE Access.

[8]  Feng Liu,et al.  Monitoring and analyzing big traffic data of a large-scale cellular network with Hadoop , 2014, IEEE Network.

[9]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[10]  Judith Kelner,et al.  High availability in clouds: systematic review and research challenges , 2016, Journal of Cloud Computing.

[11]  Javier Aracil,et al.  Multi-Gbps HTTP Traffic Analysis in Commodity Hardware Based on Local Knowledge of TCP Streams , 2017, Comput. Networks.

[12]  Cheng Fang,et al.  Spark-based large-scale matrix inversion for big data processing , 2016, INFOCOM Workshops.

[13]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[14]  M B Giles,et al.  Trends in high-performance computing for engineering calculations , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[15]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[16]  J. Singh,et al.  High Availability of Clouds: Failover Strategies for Cloud Computing Using Integrated Checkpointing Algorithms , 2012, 2012 International Conference on Communication Systems and Network Technologies.

[17]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[18]  K. Imai,et al.  Large-scale text processing pipeline with Apache Spark , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[19]  Javier Aracil,et al.  On the duration and spatial characteristics of internet traffic measurement experiments , 2008, IEEE Communications Magazine.

[20]  Marco Pulimeno,et al.  A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution , 2014, Inf. Sci..

[21]  Kenli Li,et al.  A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment , 2017, IEEE Transactions on Parallel and Distributed Systems.

[22]  Kun-Lung Wu,et al.  Parallel streaming frequency-based aggregates , 2014, SPAA.

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.