Graphics processors (GPU) are interesting for nongraphics parallel computation because of the potential for more than an order of magnitude of speedup over CPUs. Because the GPU is often presented as a C-like abstraction like Nvidia’s CUDA, little is known about the hardware architecture of the GPU beyond the high-level descriptions documented by the manufacturer. We develop a suite of micro-benchmarks to measure the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. We measure properties of the arithmetic pipelines, the stack-based handling of branch divergence, and the warp-granularity operation of the barrier synchronization instruction. We confirm that global memory is uncached with ∼441 clock cycles of latency, and measure parameters of the three levels of instruction and constant caches and three levels of TLBs. We succeed in revealing more detail about the GT200 architecture than previously disclosed.