Batched Triangular DLA for Very Small Matrices on GPUs