Improving Performance of Triangular Matrix-Vector BLAS Routines on GPUs