High performance computing for vision on distributed-memory machines

Computer vision has been identified as a Grand Challenge application by the High Performance Computing and Communication initiative. With the advancement of microprocessor technology and network technology, current massively parallel machines can achieve hundreds of Gigaflops performance. These parallel machines have a distributed-memory architecture, so they can scale to large system sizes. Examples of such machines include TMC CM-5, IBM SP-2, Intel Paragon, Meiko CS-2, and Cray T3D among others. These high-performance computing platforms seem to have opened new avenues to meet the computational challenge of vision. Even though many "Gigaflops" machines have become available, straightforward approaches to parallelizing vision applications on these architectures do not yield satisfactory performance. In the distributed-memory architecture, communication operations incur considerable overheads. Due to the irregular nature of the communication in intermediate- and high-level vision algorithms, the overheads could increase with the size of the parallel system, leading to poor performance. As a consequence, the algorithms do not scale to large system sizes. It is therefore necessary to develop efficient algorithmic techniques for various vision processes to achieve larger speed-ups. The focus of our work is to develop scalable and portable parallel algorithms for computer vision tasks on distributed-memory machines. We propose a computational model for distributed-memory machines which considers communication startup cost and data transmission rate to account for the cost in data communication. To illustrate our algorithms and implementations, we parallelize vision tasks in a building detection system and in an object recognition system. Based on the model, we show scalable algorithms for several key steps in the building system, including a linear feature extraction task and a perceptual grouping task, as well as a high-level task in an object recognition system. For portable implementations, our codes are written in C and message passing standard MPI. These codes are portable to run on several high-performance platforms. Currently, they have been ported to CM-5, SP-2, and T3D. These implementations achieve fast execution of the vision tasks. For example, given a 2048 x 2048 image, the extraction of linear feature on a 512-node CM-5 can be completed in 1.118 seconds. The same task takes more than 8 minutes on a state-of-the-art Sun Sparcstation.