Reverse Time Migration with Heterogeneous Multicore and Manycore Clusters

In this work we propose a parallel implementation of RTM based on cooperative work between CPUs and coprocessors that proved to be competitive to other accelerated solutions available. This implementation is able to run whatever the number of coprocessors is (from 0 to the maximum available with respect to the computer vendor specifications), and is very scalable in a cluster environment. Based on standard programming model it will also be portable without modification to any future configurations of Xeon and Xeon Phi, or X-CPU Y-CPU that supports MPI OpenMP C language. Here describe our unified programing model for optimized code. We also discuss load balancing of the heterogeneous cluster configuration; validate the performance; and scalability of the current implementation. In the current configuration with 4 Xeon Phi cards with 16GB GDDR5 (64 GB total), we can migrate full shot gathers on a single node. This proposed node configuration also frees memory in the 2-socket host for RTM formulations that might require saving snapshots for cross-correlation and any other auxiliary arrays between iterations of the algorithm.