Assessing the Dependability of Apache Spark System: Streaming Analytics on Large-Scale Ocean Data

Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark’s streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. In this paper, we analyze large-scale geo-temporal data collected from the USGODAE (United States Global Ocean Data Assimilation Experiment) data catalog, and showcase and assess the dependability of Spark stream processing. We measure the latency of streaming and monitor scalability by adding and removing nodes in the middle of a streaming job. We also verify the fault tolerance by stopping nodes in the middle of a job and making sure that the job is rescheduled and completed on other nodes. We design a full-stack application that automates data collection, data processing and visualizing the results. We also use Google Maps API to visualize results by color coding the world map with values from various analytics.

[1]  Helmut Krcmar,et al.  Modeling and Simulating Apache Spark Streaming Applications , 2016, Softwaretechnik-Trends.

[2]  Shaikh Arifuzzaman,et al.  Scalable Mining, Analysis, and Visualization of Protein-Protein Interaction Networks , 2019 .

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Md Abdul Motaleb Faysal,et al.  A Comparative Analysis of Large-scale Network Visualization Tools , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[5]  Patricio Córdova Analysis of Real Time Stream Processing Systems Considering Latency , 2022 .

[6]  Bernhard Rumpe,et al.  MontiMatcher: Ähnlichkeitsanalyse- Framework zur Produktlinienextraktion und Evolutionsüberwachung , 2016, Softwaretechnik-Trends.

[7]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[8]  Philipp M. Grulich,et al.  Bringing Big Data into the Car: Does it Scale? , 2017, 2017 International Conference on Big Data Innovations and Applications (Innovate-Data).

[9]  Reynold Xin,et al.  Apache Spark , 2016 .

[10]  Lei Gu,et al.  Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[11]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Chris Mattmann,et al.  SciSpark: Applying in-memory distributed computing to weather event detection and tracking , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[15]  Shaikh Arifuzzaman,et al.  Overcoming MPI Communication Overhead for Distributed Community Detection , 2018 .

[16]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[17]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[18]  Maleq Khan,et al.  Fast parallel conversion of edge list to adjacency list for large-scale graphs , 2015, SpringSim.

[19]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[20]  Madhav V. Marathe,et al.  A fast parallel algorithm for counting triangles in graphs using dynamic load balancing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[21]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[22]  Madhav V. Marathe,et al.  PATRIC: a parallel algorithm for counting triangles in massive networks , 2013, CIKM.

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.