Performance Factor Analysis and Scope of Optimization for Big Data Processing on Cluster

Use of computational cluster for large-scale Big Data processing has attracted attention as a technology trend for its time efficiency. Modern cluster equipped with latest multi, many-core distributed shared architecture, high speed interconnect and file system, ensures high performance using message passing and multi-threading parallel approaches, also handles batch, micro-batch and stream processing of high dimensional massive dataset but running data-intensive Big Data application on compute-centric cluster imposes challenges to its performance because of several runtime overheads. In order to alleviate these bottlenecks and exploit full potential of the cluster a state of the practice, performance-oriented technical analysis covering all relevant aspects is presented in the context of Terascale Big data processing on TeraFLOPS cluster PARAM-Kanchenjunga, with identification of major factors influencing the performance or sources of these overheads related to computation, communication or IPC, memory, I/O contention, scheduling, load imbalance, synchronization, latency and network jitter; by determining their impact. As existing approaches found insufficient, to achieve possible speedup advance methods with a variety of alternatives as RDMA enabled libraries, PFS, MPI-Integrated extensions, loop tiling, hybrid parallelization are provided to consider for optimization purposes. This paper will assist to prepare performance aware design of experiments and performance modeling.

[1]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[2]  Tamara G. Kolda,et al.  COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[3]  Erwin Laure,et al.  A data streaming model in MPI , 2015, ExaMPI '15.

[4]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[5]  Pramod K. Varshney,et al.  Dimensionality Reduction for Registration of High-Dimensional Data Sets , 2013, IEEE Transactions on Image Processing.

[6]  Vijay Varadharajan,et al.  A Detailed Investigation and Analysis of Using Machine Learning Techniques for Intrusion Detection , 2019, IEEE Communications Surveys & Tutorials.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[9]  Vipin Kumar,et al.  A Comparative Study of Classification Techniques for Intrusion Detection , 2013, 2013 International Symposium on Computational and Business Intelligence.

[10]  Dhabaleswar K. Panda,et al.  High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[11]  Mahesh Chandra Govil,et al.  A comparative analysis of SVM and its stacking with other classification algorithm for intrusion detection , 2016, 2016 International Conference on Advances in Computing, Communication, & Automation (ICACCA) (Spring).

[12]  Nikolay Malitsky Bringing the HPC reconstruction algorithms to Big Data platforms , 2016, 2016 New York Scientific Data Summit (NYSDS).

[13]  George Bosilca,et al.  Open MPI: A High-Performance, Heterogeneous MPI , 2006, 2006 IEEE International Conference on Cluster Computing.

[14]  Shadi Ibrahim,et al.  On the Performance of Spark on HPC Systems: Towards a Complete Picture , 2018, SCFA.

[15]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[16]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[17]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[18]  David A. Padua,et al.  Programming with tiles , 2008, PPOPP.