Fast 4-way parallel radix sorting on GPUs