Parallelization of programs containing loop-carried dependences with resource constraints

This dissertation proposes scheduling methods for parallelizing programs containing two types of common loops with loop-carried dependencies (LCD's): parallel prefix computation (PPC) and band linear recurrences (BLR). They are major bottlenecks of parallel computations of many important programs (e.g., some Grand Challenge Problems). My dissertation is based on the belief that a good computer research should bring innovative conceptual advances that will have practical impacts, and these new ideas must be supported with useful theories, and must be validated by implementation. As such, I propose a new scheduling method, called Harmonic Scheduling (HS). HS is a technique for design space exploration of parallel schedules for evaluation of BLR's. Using HS, I have found time-optimal parallel schedules for PPC (also known as scan) with resource constraints. I have also derived new classes of parallel schedules for computing BLR's with resource constraints and shown that they achieve the optimal time for first and second-order BLR's. Using HS, I have derived the Regular Schedules, which are scalable and have regular computation structures. Using regular schedules, I have developed a method for parallel programming loops containing PPC and BLR's intermixed with other code. Using the Regular-Schedule-based programming method, we obtained significant performance improvement for a range of benchmark programs on the Convex C240 vector supercomputer, over the same programs coded using highly-optimized BLAS (Basic Linear Algebra System) routines--which are the best available, hand-coded assembly routines for vector parallel computers. I have also applied HS to the design of special-purpose parallel architectures for computing infinite-impulse response (IIR) filters in digital signal processing (DSP). HS can be used as a tool for design space exploration in both performance-driven and resource-driven system-level design of these architectures in CAD environments. I found the rate-optimal schedules for IIR filters on a class of VLSI architectures with differing types of functional units. Based on this result, I generated the first scalable design for IIR filters using multi-chip module (MCM) technology. Previous designs are not scalable due to the loop-carried dependence bottlenecks. Using HS, I also explored design space of scalable architectures implementing IIR filters based on MCM's and scalable interconnect, and demonstrated that the design space for IIR filter architectures by applying HS is only bound by hardware limits whereas the design space using previous scheduling methods is very limited due to LCD's in the programs.