Optimal Filter Partition for Efficient Convolution with Short Input/Output Delay

A new algorithm to find an optimal filter partition for efficient long convolution with low input/output delay is presented. For a specified input/output delay and filter length, our algorithm finds the non-uniform filter partition that minimizes computational cost of the convolution. We perform a detailed cost analysis of different block convolution schemes, and show that our optimal-partition finder algorithm allows for significant performance improvement. Furthermore, when several long convolutions are computed in parallel and their outputs are mixed down (as is the case in multiple-source 3-D audio rendering), the algorithm finds an optimal partition (common to all channels) that allows for further performance optimization. INTRODUCTION The direct implementation of the convolution sum in the time domain has no inherent latency, but its computational cost measured as the number of multiply-add operations per output sample increases linearly with the length of the convolving filter [4], which makes this algorithm impractical for performing long convolutions in real-time. On the other hand, frequency-domain single-block convolution based on the overlap-add or overlap-save schemes [7] has a cost per output sample that increases only logarithmically with the length of the convolving filter. However, this high efficiency comes at the expense of an input/output delay equal to at least the impulse response length [4]. A common approach to achieve low latency while keeping computational cost down is to partition the convolving impulse response into shorter blocks [1]. The filter can be represented by an equivalent set of shorter filters in parallel, as represented in figure 1. Each parallel branch consists of one block of impulse response, and a delay equal to the time-offset of that block into the impulse response. Single-block convolution is performed independently for each branch, and the branch outputs are overlap-added. GARCIA OPTIMAL FILTER PARTITION FOR EFFICIENT CONVOLUTION WITH SHORT INPUT/OUTPUT DELAY AES 113 CONVENTION, LOS ANGELES, CA, USA, 2002 OCTOBER 5–8 2 Figure 1: Long filter partitioned into four blocks (top), and equivalent parallel structure of four shorter filters (bottom). The simplest scheme of this kind consists of a partition into blocks of uniform length. In this case, only one FFT on the input is needed for each output sample block, since single-block convolutions in each branch use the same block length. If the partition is made into blocks of different sizes, then one FFT on the input is needed for each block size. [4] presents an efficient multiple-block algorithm based on a particular non-uniform partition into blocks of increasing size, with shorter blocks heading the impulse response. This exploits the fact that short blocks provide low latency, whereas longer blocks make the convolution less expensive. This additional degree of freedom makes this scheme generally more efficient than uniform partition. However, uniform partition offers room for performance optimization if the overlap-add is performed in the frequency-domain. FFT blocks after spectral multiply can be overlap-added directly in the frequency domain [2][3], i.e. onto a “frequencydomain delay line” (FDL), and then only one inverse FFT needs to be performed for each input FFT. This algorithm has a convex cost function and the optimal block size can be obtained by derivation. However, for long filters the optimal block size is usually too long compared to acceptable input/output delay values, and the cost increases dramatically when block length is shortened. One solution to this is to partition the filter into two FDLs, i.e. two sets of uniform-length blocks: a header FDL of short block size fixed by the latency requirement, followed by a second FDL of longer block size to keep cost down. In our paper we show how to optimize the cost of this algorithm by varying the block size and number of blocks of the second FDL, and the number of blocks of the header FDL. The cost of this “double-FDL convolution” algorithm remains at much lower levels for short input/output latencies. The double-FDL approach suggests that further performance optimization could be achieved with a partition into multiple FDLs, i.e. into multiple segments where each segment consists of a set of uniform-length blocks, each segment having a block length larger than the previous segment. The parameters of this partition scheme are the total number of FDLs and the block size and number of blocks of each FDL. However, for impulse responses several seconds long and low specified input/output latencies, the number of possible multiple-FDL partitions is quite large and it is not trivial how to choose an efficient one. Performing an exhaustive search over all possible partitions can be computationally very expensive, especially when the partition needs to be updated periodically in order to track variable filter lengths, and there is little hope that an arbitrarily picked partition be efficient. In this paper, we present an efficient algorithm to automatically find the optimal multiple-FDL partition that minimizes computational cost of the convolution, for a given filter length and a specified input/output delay. The algorithm uses dynamic programming and allows for optimization of several convolution channels in parallel, where further performance improvement can be achieved by downmixing the channels in the frequency-domain, if the same partition is used for all channels. Cost comparisons show that multiple-FDL convolution, based on the optimal partition found by our algorithm, is more than twice as efficient as the non-uniform block convolution algorithm given in [4]. In the following sections we present a cost analysis of the main frequency-domain block convolution schemes, and then describe how to find the optimal partition for the multiple-FDL scheme. COST ANALYSIS OF FREQUENCY-DOMAIN BLOCK CONVOLUTION ALGORITHMS In the following cost analysis, we assume that the Discrete Fourier Transform is computed using the FFT algorithm. The cost to compute a N-point real H1 H2 H3 H4