Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. In this paper, we analyze large-scale geo-temporal data collected from the USGODAE (United States Global Ocean Data Assimilation Experiment) data catalog, and showcase and assess the ability of Spark stream processing. We measure the latency of streaming and monitor scalability by adding and removing nodes in the middle of a streaming job. We also verify the fault tolerance by stopping nodes in the middle of a job and making sure that the job is rescheduled and completed on other nodes. We design a full-stack application that automates data collection, data processing and visualizing the results. We also use Google Maps API to visualize results by color coding the world map with values from various analytics.

[1]  Helmut Krcmar,et al.  Modeling and Simulating Apache Spark Streaming Applications , 2016, Softwaretechnik-Trends.

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[4]  Philipp M. Grulich,et al.  Bringing Big Data into the Car: Does it Scale? , 2017, 2017 International Conference on Big Data Innovations and Applications (Innovate-Data).

[5]  Lei Gu,et al.  Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[6]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7]  Chris Mattmann,et al.  SciSpark: Applying in-memory distributed computing to weather event detection and tracking , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Reynold Xin,et al.  Apache Spark , 2016 .

[10]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[11]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[12]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[15]  Patricio Córdova Analysis of Real Time Stream Processing Systems Considering Latency , 2022 .

[16]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[17]  Madhav V. Marathe,et al.  A fast parallel algorithm for counting triangles in graphs using dynamic load balancing , 2015, 2015 IEEE International Conference on Big Data (Big Data).