The computing models for HEP experiments are globally distributed and grid-based. Obstacles to good network performance arise from many causes and can be a major impediment to the success of the computing models for HEP experiments. Factors that affect overall network/application performance exist on the hosts themselves (application software, operating system, hardware), in the local area networks that support the end systems, and within the wide area networks. Since the computer and network systems are globally distributed, it can be very difficult to locate and identify the factors that are hurting application performance. In this paper, we present an end-to-end network/application performance troubleshooting methodology developed and in use at Fermilab. The core of our approach is to narrow down the problem scope with a divide and conquer strategy. The overall complex problem is split into two distinct sub-problems: host diagnosis and tuning, and network path analysis. After satisfactorily evaluating, and if necessary resolving, each sub-problem, we conduct end-to-end performance analysis and diagnosis. The paper will discuss tools we use as part of the methodology. The long term objective of the effort is to enable site administrators and end users to conduct much of the troubleshooting themselves, before (or instead of) more » calling upon network and operating system 'wizards,' who are always in short supply. « less
[1]
Lixin Gao,et al.
A measurement study on the impact of routing events on end-to-end internet path performance
,
2006,
SIGCOMM.
[2]
Vern Paxson,et al.
Strategies for sound internet measurement
,
2004,
IMC '04.
[3]
Lixin Gao,et al.
A measurement study on the impact of routing events on end-to-end internet path performance
,
2006,
SIGCOMM 2006.
[4]
Konstantina Papagiannaki,et al.
Network performance monitoring at small time scales
,
2003,
IMC '03.
[5]
Jia Wang,et al.
Locating internet bottlenecks: algorithms, measurements, and implications
,
2004,
SIGCOMM '04.