Author retrospective for optimizing matrix multiply using PHiPAC: a portable high-performance ANSI C coding methodology

PHiPAC was an early attempt to improve software performance by searching in a large design space of possible implementations to find the best one. At the time, in the early 1990s, the most efficient numerical linear algebra libraries were carefully hand tuned for specific microarchitectures and compilers, and were often written in assembly language. This allowed very precise tuning of an algorithm to the specifics of a current platform, and provided great opportunity for high efficiency. The prevailing thought at the time was that such an approach was necessary to produce near-peak performance. On the other hand, this approach was brittle, and required great human effort to try each code variant, and so only a tiny subset of the possible code design points could be explored. Worse, given the combined complexities of the compiler and microarchitecture, it was difficult to predict which code variants would be worth the implementation effort. PHiPAC circumvented this effort by using code generators that could easily generate a vast assortment of very different points within a design space, and even across very different design spaces altogether. By following a set of carefully crafted coding guidelines, the generated code was reasonably efficient for any point in the design space. To search the design space, PHiPAC took a rather naive but effective approach. Due to the human-designed and deterministic nature of computing systems, one might reasonably think that smart modeling of the microprocessor and compiler would be sufficient to predict, without performing any timing, the optimal point for a given algorithm. But the combination of an optimizing compiler and a dynamically scheduled microarchitecture with multiple levels of cache and prefetching results in a very complex, highly non-linear system. Small changes in the code can cause large changes in performance, thwarting "smart" search algorithms that attempt to build models of the performance surface. PHiPAC instead used massive, diverse, offline, and only loosely informed grid or randomized search, where each point was evaluated by empirically measuring the code running on the machine, which succeeded due to its indifference to the shape of the performance surface.

[1]  James Demmel,et al.  Using PHiPAC to speed error back-propagation learning , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  James Demmel,et al.  The PHiPAC v1.0 Matrix-Multiply Distribution , 1998 .

[3]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[4]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[5]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[6]  Matteo Frigo,et al.  Portable high-performance programs , 1999 .

[7]  Jeff A. Bilmes,et al.  The Ring Array Processor: A Multiprocessing Peripheral for Connection Applications , 1992, J. Parallel Distributed Comput..