论文信息 - Linux Kernel Debugging on Google-sized clusters

Linux Kernel Debugging on Google-sized clusters

This paper will discuss the difficulties and methods involved in debugging the Linux kernel on huge clusters. Intermittent errors that occur once every few years are hard to debug and become a real problem when running across thousands of machines simultaneously. The more we scale clusters, the more reliability becomes critical. Many of the normal debugging luxuries like a serial console or physical access are unavailable. Instead, we need a new strategy for addressing thorny intermittent race conditions. This paper presents the case for a new set of tools that are critical to solve these problems and also very useful in a broader context. It then presents the design for one such tool created from a hybrid of a Google internal tool and the open source LTTng project. Real world case studies are included.

Mathieu Desnoyers | M. Desnoyers | Martin J. Bligh | Martin Bligh | Martin Bligh

[1] R.W. Wisniewski,et al. Efficient, Unified, and Scalable Performance Monitoring for Multiprocessor Operating Systems , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[2] Michel Dagenais,et al. System Administration: The Linux Trace Toolkit , 2000 .

[3] Bryan Cantrill,et al. Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[4] M. Desnoyers. Low Disturbance Embedded System Tracing with Linux Trace Toolkit Next Generation , 2006 .

[5] M. Desnoyers,et al. The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux , 2006 .

[6] Robert Wisniewski. relayfs : An Efficient Unified Approach for Transmitting Data from Kernel to User Space , 2003 .