Using asynchronous and bulk communications to construct an optimizing compiler for distributed-memory machines with consideration given to communications costs