Identifying the Culprits Behind Network Congestion

Network congestion is one of the primary causes of performance degradation, performance variability and poor scaling in communication-heavy parallel applications. However, the causes and mechanisms of network congestion on modern interconnection networks are not well understood. We need new approaches to analyze, model and predict this critical behaviour in order to improve the performance of large-scale parallel applications. This paper applies supervised learning algorithms, such as forests of extremely randomized trees and gradient boosted regression trees, to perform regression analysis on communication data and application execution time. Using data derived from multiple executions, we create models to predict the execution time of communication-heavy parallel applications. This analysis also identifies the features and associated hardware components that have the most impact on network congestion and intern, on execution time. The ideas presented in this paper have wide applicability: predicting the execution time on a different number of nodes, or different input datasets, or even for an unknown code, identifying the best configuration parameters for an application, and finding the root causes of network congestion on different architectures.

[1]  Dharma P. Agrawal,et al.  Congestion Control in the Wormhole-Routed Torus with Clustering and Delayed Deflection , 1997, PCRCW.

[2]  Bernd Hamann,et al.  Mapping applications with collectives over sub-communicators on torus networks , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  J. Friedman Stochastic gradient boosting , 2002 .

[4]  C. DeTar,et al.  Scaling tests of the improved Kogut-Susskind quark action , 1999, hep-lat/9912018.

[5]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Abhinav Vishnu,et al.  Diagnosing the causes and severity of one-sided message contention , 2015, PPoPP 2015.

[8]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[9]  Laxmikant V. Kalé,et al.  Quantifying Network Contention on Large Parallel Machines , 2009, Parallel Process. Lett..

[10]  Alois Knoll,et al.  Gradient boosting machines, a tutorial , 2013, Front. Neurorobot..

[11]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[12]  Laxmikant V. Kalé,et al.  Optimizing communication for Charm++ applications by reducing network contention , 2011, Concurr. Comput. Pract. Exp..

[13]  Brandon M. Malone,et al.  A Learning-based Selection for Portfolio Scheduling of Scientific Applications on Heterogeneous Computing Systems , 2014, CloudCom 2014.

[14]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[15]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  José Duato,et al.  Combining Congested-Flow Isolation and Injection Throttling in HPC Interconnection Networks , 2011, 2011 International Conference on Parallel Processing.

[17]  Laxmikant V. Kalé,et al.  Predicting application performance using supervised learning on communication features , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Abhinav Bhatele,et al.  pF3D Simulations of Laser-Plasma Interactions in National Ignition Facility Experiments , 2014, Computing in Science & Engineering.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.