论文信息 - On The [Ir]relevance of Network Performance for Data Processing

On The [Ir]relevance of Network Performance for Data Processing

Modern data processing frameworks are used in a variety of settings for a diverse set of workloads such as sorting, indexing, iterative computations, structured query processing, etc. As these frameworks run in a distributed environment, a natural question to ask is – how important is the network to the performance of these frameworks? Recent research in this field has led to contradictory results. One camp advocates the limited impact of networking performance on the overall performance of the framework [16]. On the other hand, there is a large body of work on networking optimizations for data processing frameworks [9, 10, 18, 19, 21]. In this paper, we search for a better understanding of the matter. While answering the basic question concerning the importance of the network performance, our analysis raises new questions and points to previously unexplored or unnoticed avenues for performance optimizations. We take Apache Spark [2] as a representative of a modern data-processing framework. However, to broaden the scope of our investigation, we also experiment with other frameworks such as Flink, PowerGraph or Timely. In our study – rather than analysing Spark-specific peculiarities – we look into procedures and subsystems that are common in any of these frameworks such as networking IO, shuffle data management, object (de)serialization, copies, job scheduling and coordination, etc. Nonetheless, we are aware that the roles of those individual components are different for the various systems, and we exercise caution when making generalized statements about the performance. Our study reveals three main findings: (a) up to a certain level, the performance of the network has significant effects on the overall performance of the data processing framework. Specifically, for all the workloads and frameworks we analyzed, moving from a 1 to a 10 Gbps network reduced the query response time by a factor of two and more (see Figure 1). This result directly 0 50 100 150 200 250

[1] Michael I. Jordan,et al. Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[2] Scott Shenker,et al. The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[3] Kristyn J. Maschhoff,et al. Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms , 2015 .

[4] Amin Vahdat,et al. Themis: an I/O-efficient MapReduce , 2012, SoCC '12.

[5] Hosung Park,et al. What is Twitter, a social network or a news media? , 2010, WWW '10.

[6] Michael I. Jordan,et al. Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM 2011.

[7] Patrick Wendell,et al. Sparrow: distributed, low latency scheduling , 2013, SOSP.

[8] Michael Isard,et al. Scalability! But at what COST? , 2015, HotOS.

[9] Ion Stoica,et al. Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[10] Ion Stoica,et al. Efficient coflow scheduling with Varys , 2015, SIGCOMM.

[11] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[12] Amin Vahdat,et al. TritonSort: A Balanced Large-Scale Sorting System , 2011, NSDI.

[13] Ramana Rao Kompella,et al. The TCP Outcast Problem: Exposing Unfairness in Data Center Networks , 2012, NSDI.

[14] Wei Lin,et al. Microsoft Bing Peking University , 2022 .

[15] Joseph Gonzalez,et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.