Visualization and Adaptive Subsetting of Earth Science Data in HDFS: A Novel Data Analysis Strategy with Hadoop and Spark

Data analytics becomes increasingly important in big data applications. Adaptively subsetting large amounts of data to extract the interesting events such as the centers of hurricane or thunderstorm, statistically analyzing and visualizing the subset data, is an effective way to analyze ever-growing data. This is particularly crucial for analyzing Earth Science data, such as extreme weather. The Hadoop ecosystem (i.e., HDFS, MapReduce, Hive) provides a cost-efficient big data management environment and is being explored for analyzing big Earth Science data. Our study investigates the potential of a MapReduce-like paradigm to perform statistical calculations, and utilizes the calculated results to subset as well as visualize data in a scalable and efficient way. RHadoop and SparkR are deployed to enable R to access and process data in parallel with Hadoop and Spark, respectively. The regular R libraries and tools are utilized to create and manipulate images. Statistical calculations, such as maximum and average variable values, are carried with R or SQL. We have developed a strategy to conduct query and visualization within one phase, and thus significantly improve the overall performance in a scalable way. The technical challenges and limitations of both Hadoop and Spark platforms for R are also discussed.

[1]  Yi Wang,et al.  Smart: a MapReduce-like framework for in-situ scientific analytics , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Yakov Shafranovich,et al.  Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.

[3]  Steve Weston,et al.  Foreach Parallel Adaptor for the 'parallel' Package , 2015 .

[4]  Ken Kennedy,et al.  Automotive big data , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[5]  Ahmed Eldawy,et al.  HadoopViz: A MapReduce framework for extensible visualization of big spatial data , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[6]  Jian Zhou,et al.  Opass: Analysis and Optimization of Parallel Data Access on Distributed File Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[7]  Xu Liu,et al.  Towards Hybrid Programming in Big Data , 2015, HotCloud.

[8]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[9]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[10]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[11]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[12]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[15]  Chen Feng,et al.  Dominoes: Speculative Repair in Erasure-Coded Hadoop System , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[16]  Valerio Pascucci,et al.  Parallel visualization on large clusters using MapReduce , 2011, 2011 IEEE Symposium on Large Data Analysis and Visualization.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  T. Lawson,et al.  Spark , 2011 .

[19]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[20]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[21]  Xian-He Sun,et al.  IOSIG+: On the Role of I/O Tracing and Analysis for Hadoop Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[22]  Xian-He Sun,et al.  A Hadoop-based visualization and diagnosis framework for earth science data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[23]  Brigid Carroll,et al.  Montage , 2015, Montage.

[24]  Ken Kennedy,et al.  Automotive big data: Applications, workloads and infrastructures , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[25]  Lavanya Ramakrishnan,et al.  AnalyzeThis: an analysis workflow-aware storage system , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.