Performance debugging using parallel performance predicates

Parallel programs incur overhead in many di erent ways, such as synchronization, load imbalance, communication, and insu cient parallelism. We have found that all of these categories are important in understanding the performance of parallel programs, and that a rapid assessment of how processing time is spent in each of these categories is extremely helpful in the performance tuning of parallel programs. As a result we have developed the notion of performance predicates, which are expressions that de ne these categories and can be used to recognize and classify ine cient states during a program's execution. Formal de nition allows us to discuss the categories quantitatively; we present a method for measuring time spent in each category, based on the common metric of lost cycles. The method we describe, called predicate pro ling, is shown to be quite useful for both applicationlevel and program-level performance tuning. We show that predicate pro ling is relatively easy to implement, and has very low run-time cost. We also show that the lost cycles metric is applicable to programs for which other metrics, like speedup, aren't well de ned.