Towards a large number of pipeline processors in a tightly coupled multiprocessor using no cache

Need for performance exists in many scientific applications. The use of multiprocessor structures can not be avoided. The mapping of many applications on distributed supercomputers (e.g. hypercube structure) seems very difficult. On the other hand, performance on most of the large shared memory systems (CEDAR, RP3, ..) suffers from a very high latency of request on the shared memor; caches or local memories are often used to increase performance. Performance depends on a good management of the memory hierarchy (and of the synchronization mechanisms) by the programmer. In previous papers, we have pointed out that passing the WRITEs by the READs on a memory with hardware detection of Read After Write (RAW) hazards allows to reach correct performance on a pipeline processor on a very large spectrum of numerical algorithms even when using a memory with a high latency. It also enables to efficiently synchronize pipeline processors working directly on a shared memory in a relatively small tightly coupled multiprocessor (less than twenty pipeline processors). In this paper, we propose a possible structure of memory access for a tightly coupled multiprocessor with a large number of pipeline processors (64 or 256) working directly on a shared memory.