Summary: Many tricks and techniques have been developed to speed up the computation of FFTs. Significant reductions in computation time result from table lookup of twiddle factors, compiler-friendly or assembly-language programming, special hardware, and FFT algorithms for real-valued data. Higher-radix algorithms, fast bit-reversal, and special butterflies yield more modest but worthwhile savings.
Radix-4 FFT Algorithms
Split-Radix FFT Algorithm
Decimation-in-Time (DIT) Radix-2 FFT
Decimation-in-Frequency (DIF) Radix-2 FFT
The use of FFT algorithms such as the radix-2 decimation-in-time or decimation-in-frequency methods result in tremendous savings in computations when computing the discrete Fourier transform. While most of the speed-up of FFTs comes from this, careful implementation can provide additional savings ranging from a few percent to several-fold increases in program speed.
The twiddle factor, or
On most computers, only some of the total computation time of an FFT is spent performing the FFT butterfly computations; determining indices, loading and storing data, computing loop parameters and other operations consume the majority of cycles. Careful programming that allows the compiler to generate efficient code can make a several-fold improvement in the run-time of an FFT. The best choice of radix in terms of program speed may depend more on characteristics of the hardware (such as the number of CPU registers) or compiler than on the exact number of computations. Very often the manufacturer's library codes are carefully crafted by experts who know intimately both the hardware and compiler architecture and how to get the most performance out of them, so use of well-written FFT libraries is generally recommended. Certain freely available programs and libraries are also very good. Perhaps the best current general-purpose library is the FFTW package; information can be found at http://www.fftw.org. A paper by Frigo and Johnson describes many of the key issues in developing compiler-friendly code.
While compilers continue to improve, FFT programs written directly in the assembly language of a specific machine are often several times faster than the best compiled code. This is particularly true for DSP microprocessors, which have special instructions for accelerating FFTs that compilers don't use. (I have myself seen differences of up to 26 to 1 in favor of assembly!) Very often, FFTs in the manufacturer's or high-performance third-party libraries are hand-coded in assembly. For DSP microprocessors, the codes developed by Meyer, Schuessler, and Schwarz are perhaps the best ever developed; while the particular processors are now obsolete, the techniques remain equally relevant today. Most DSP processors provide special instructions and a hardware design favoring the radix-2 decimation-in-time algorithm, which is thus generally fastest on these machines.
Some processors have special hardware accelerators or co-processors specifically designed to accelerate FFT computations. For example, AMI Semiconductor's Toccata ultra-low-power DSP microprocessor family, which is widely used in digital hearing aids, have on-chip FFT accelerators; it is always faster and more power-efficient to use such accelerators and whatever radix they prefer.
In a surprising number of applications, almost all of the computations are FFTs. A number of special-purpose chips are designed to specifically compute FFTs, and are used in specialized high-performance applications such as radar systems. Other systems, such as OFDM-based communications receivers, have special FFT hardware built into the digital receiver circuit. Such hardware can run many times faster, with much less power consumption, than FFT programs on general-purpose processors.
Cache misses or excessive data movement between registers and memory can greatly slow down an FFT computation. Efficient programs such as the FFTW package are carefully designed to minimize these inefficiences. In-place algorithms reuse the data memory throughout the transform, which can reduce cache misses for longer lengths.
FFTs of real-valued signals require only half as many computations as with complex-valued data. There are several methods for reducing the computation, which are described in more detail in Sorensen et al.
Occasionally only certain DFT frequencies are needed, the input signal values are mostly zero, the signal is real-valued (as discussed above), or other special conditions exist for which faster algorithms can be developed. Sorensen and Burrus describe slightly faster algorithms for pruned or zero-padded data. Goertzel's algorithm is useful when only a few DFT outputs are needed. The running FFT can be faster when DFTs of highly overlapped blocks of data are needed, as in a spectrogram.
Higher-radix algorithms, such as the radix-4, radix-8, or split-radix FFTs, require fewer computations and can produce modest but worthwhile savings. Even the split-radix FFT reduces the multiplications by only 33% and the additions by a much lesser amount relative to the radix-2 FFTs; significant improvements in program speed are often due to implicit loop-unrolling or other compiler benefits than from the computational reduction itself!
Bit-reversing the input or output data can consume several percent of the total run-time of an FFT program. Several fast bit-reversal algorithms have been developed that can reduce this to two percent or less, including the method published by D.M.W. Evans.
When FFTs first became widely used, hardware multipliers were relatively rare on digital computers, and multiplications generally required many more cycles than additions. Methods to reduce multiplications, even at the expense of a substantial increase in additions, were often beneficial. The prime factor algorithms and the Winograd Fourier transform algorithms, which required fewer multiplies and considerably more additions than the power-of-two-length algorithms, were developed during this period. Current processors generally have high-speed pipelined hardware multipliers, so trading multiplies for additions is often no longer beneficial. In particular, most machines now support single-cycle multiply-accumulate (MAC) operations, so balancing the number of multiplies and adds and combining them into single-cycle MACs generally results in the fastest code. Thus, the prime-factor and Winograd FFTs are rarely used today unless the application requires FFTs of a specific length.
It is possible to implement a complex multiply with
3 real multiplies and 5 real adds rather than the usual
4 real multiplies and 2 real adds:
Certain twiddle factors,
namely
When optimizing FFTs for speed, it can be important to maintain perspective on the benefits that can be expected from any given optimization. The following list categorizes the various techniques by potential benefit; these will be somewhat situation- and machine-dependent, but clearly one should begin with the most significant and put the most effort where the pay-off is likely to be largest.