An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation