Data is operated on in the spectral domain, often transformed from either a temporal or spatial domain. During a transformation, spectral methods typically use multiple stages, where the dependencies within a stage for a set of butterfly patterns (eg. Cooley-Tukey Fast Fourier Transform or the many variants of Wavelet transforms). Each butterfly operation has two inputs and two outputs (each typically a complex number) and perform a set of multiply and add operations. The stages in the algorithm can be refactored and reordered to improve locality, but these variants require one or more steps of global data reordering that can be viewed as a transpose. The canonical transform is implemented as radix-2 which limits the transformation sizes to power of two sized data arrays that requires some care to avoid cache-line aliasing effects, but higher-radix implementations exist for prime numbered radixes (eg. radix 2, 3, 5, or 7). Multidimensional transformations are typically composed as the sequential application of 1D transforms for each of the principle dimensions of the calculation.
The code is highly vectorizable using unit-stride and constant-stride addressing, although efficiency is greater when multiple transforms are performed simultaneously or the problem is large enough that it can be broken into several simultaneous smaller ones. Some variants exploit scatter-gather to reorder elements. The code has better temporal locality than either structured or unstructured grids, e.g., an FFT is an O(n log n) operation on n values. These methods also have some spatial locality but can require large strided accesses for some forms. The addressing pattern is regular and statically determinable and so hardware and software prefetching can be effective, although the strides will change between stages. Power-of-2 strides are also common, but they can be avoided with careful programming if they perform poorly.
Although a parallel 1D spectral transform does form the basis for the HPC Challenge FFTE benchmark problem, partitioning a 1D FFT across multiple processors is rarely if ever required in practice. Normally, a multidimensional transform is partitioned across multiple processors such that a number of 1D transformations are performed locally on each processor. The algorithms are refactored so that most stages are computed using data local to a node, with a few stages requiring global all-all communication for a transpose operation. For example, a 2D FFT problem is decomposed into two stages of 1D FFTs – one performed in the X direction followed by another performed in the Y direction. For a distributed memory machine, the data is partitioned so that each row of data in the X-dimension is contiguous for each processor. Following the 1D FFTs in the X direction, the data must be explicitly transposed so that the columns will be contiguous on each processor for the Y-dimension of the transpose.
Original text by Krste Asanovic. Additions by Kathy Yelick, John Shalf, and Ras Bodik.