Batched LU Factorization With Fast Row Interchanges for Small Matrices on GPUs