Streaming Machine Learning Algorithms with Big Data Systems

Designing low latency applications that can process large volumes data with higher efficiency is a challenging problem. With the limited time to process data, usage of online algorithms are becoming important in the big-data applications. Stream processing is a well-known area that has been studied for a long time. In this research, our objective is to use state of the art big-data analytic engines to implement online algorithms and compare the strengths and weaknesses in each system. We use a streaming version of Support Vector Machines (SVM) and KMeans to do the analysis. Apache Flink, Apache Storm and Twister2 streaming frameworks are used to implement these algorithms. Our study focuses on the efficiency of online training of these algorithms and the results show higher performance in Twister2 framework for these algorithms.

[1]  Geoffrey C. Fox,et al.  Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environments , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[2]  Gurhan Gunduz,et al.  Twister2: TSet High-Performance Iterative Dataflow , 2019, 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS).

[3]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[4]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[5]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[6]  Matthias Sax,et al.  Apache Kafka , 2019, Encyclopedia of Big Data Technologies.

[7]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[8]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[9]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[10]  Zhuo Liu,et al.  Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[11]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[12]  Geoffrey Fox,et al.  Twister2: Design of a big data toolkit , 2020, Concurr. Comput. Pract. Exp..

[13]  Zhengping Qian,et al.  TimeStream: reliable stream computation in the cloud , 2013, EuroSys '13.