Cilk provides the "best overall productivity" for high performance computing: (and won the HPC challenge award to prove it)

My entry won award for "Best Overall Productivity" in the 2006 HPC Challenge Class 2 (productivity) competition. I used the Cilk multithreaded programming language [1] to implement all six of the benchmarks, including LU decomposition with partial pivoting, matrix multiplication, vector add, matrix transpose, updates of random locations in a large table, and a huge 1-dimensional FFT. I measured the performance on the NASA's "Columbia" SGI Altix system. The programs achieved good performance (e.g., up to 943Flops on 256 processors for matrix multiplication). I added a total of only 137 keywords to transform the six C programs into Cilk programs.

[1]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[2]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).