Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors