On The [Ir]relevance of Network Performance for Data Processing

Modern data processing frameworks are used in a variety of settings for a diverse set of workloads such as sorting, indexing, iterative computations, structured query processing, etc. As these frameworks run in a distributed environment, a natural question to ask is – how important is the network to the performance of these frameworks? Recent research in this field has led to contradictory results. One camp advocates the limited impact of networking performance on the overall performance of the framework [16]. On the other hand, there is a large body of work on networking optimizations for data processing frameworks [9, 10, 18, 19, 21]. In this paper, we search for a better understanding of the matter. While answering the basic question concerning the importance of the network performance, our analysis raises new questions and points to previously unexplored or unnoticed avenues for performance optimizations. We take Apache Spark [2] as a representative of a modern data-processing framework. However, to broaden the scope of our investigation, we also experiment with other frameworks such as Flink, PowerGraph or Timely. In our study – rather than analysing Spark-specific peculiarities – we look into procedures and subsystems that are common in any of these frameworks such as networking IO, shuffle data management, object (de)serialization, copies, job scheduling and coordination, etc. Nonetheless, we are aware that the roles of those individual components are different for the various systems, and we exercise caution when making generalized statements about the performance. Our study reveals three main findings: (a) up to a certain level, the performance of the network has significant effects on the overall performance of the data processing framework. Specifically, for all the workloads and frameworks we analyzed, moving from a 1 to a 10 Gbps network reduced the query response time by a factor of two and more (see Figure 1). This result directly 0 50 100 150 200 250