NEC

MathKeisan User's Guide


FFT and PARFFT

Introduction

FFT is a library which performs Fast Fourier Transforms. The FFT is an algorithm for calculating Discrete Fourier Transforms (DFT). The DFT of the $ n $ complex numbers $ (x_{j} , j = 0, 1, \ldots, n-1) $ is the $ n $ complex numbers $ (y_{k}, k = 0, 1, \ldots, n-1) $, given by

\[
y_k = \sum_{j=0}^{n-1} x_j e^{\frac{-2 \pi i}{n} jk} 
\; \; \; \; \; \;  k = 0, 1, \ldots, n-1 
\]

In this document, $ i^{2} = -1 $. The reverse transform is given by

\[
x_j = \frac{1}{n} \sum_{k=0}^{n-1} y_k e^{\frac{ 2 \pi i }{n} jk} 
\; \; \; \; \;  j = 0, 1, \ldots, n-1
\]

In the above equations, there is no scaling on the forward transform, and the reverse transform is scaled by $ 1/n $. An alternative and equally valid convention is to scale the forward transform by $ 1/n $ and have no scaling on the reverse transform, or to scale both transforms by $ 1/\sqrt{n} $.

The FFT in MathKeisan for Itanium® Processor Family (IPF) are from the Intel® Math Kernel Library (MKL). They have the MKL FFT and DFT interfaces. For more information, see the MKL Manual. The default location is /opt/MathKeisan/MKL/doc/mklman.pdf, or the relative path ../../../MKL/doc/mklman.pdf.

The remainder of this chapter only pertains to the FFTs in MathKeisan for SX.

User Interface

Software packages for FFTs differ greatly in functionality, and there is no standard user interface. The SX MathKeisan FFTs contain the same functionality as Cray® LibSci™ 3.1 and HP MLIB VECLIB [3]. The interfaces are the same except that the size of work arrays in MathKeisan LibSci™ differ from those in Cray's LibSci™. The FFT Man Pages tell how to compute work array sizes.

MathKeisan FFT supports these types of transforms:

They are all available for these data type mappings:

User interface information is available from several sources:

PARFFT

PARFFT is an OpenMP parallel version of FFT. It has the same user interface as FFT, so any code that is linked to FFT can alternatively be linked to PARFFT. If the environment variable OMP_NUM_THREADS is set to np, then PARFFT will run on np threads. If OMP_NUM_THREADS is unset, PARFFT will run on mp threads, where mp is the maximum number of processes in the resource group or node.

A Brief Sketch of the FFT Algorithm

If we let

\[
\omega^k_n = e^{\frac{-2 \pi i}{n}k} = \cos \frac{2 \pi }{n}k
- i \sin\frac{2 \pi }{n}k
\]

then the DFT is given by the matrix-vector multiplication:

\[
\left[ \begin{array}{lllcl}
\omega_n^0 & \omega_n^0 & \omega_n^0  & \cdots &  \omega_n^0       \\
\omega_n^0 & \omega_n^1 & \omega_n^2  & \cdots &  \omega_n^{n-1}   \\
\omega_n^0 & \omega_n^2 & \omega_n^4  & \cdots &  \omega_n^{2(n-1)}   \\
\vdots     & \vdots     & \vdots      &        &  \vdots              \\
\omega_n^0 & \omega_n^{(n-1)} & \omega_n^{2(n-1)}  & \cdots & \omega_n^{(n-1)(n-1)}   \\
\end{array} \right]
\left[ \begin{array}{c}
x_0     \\
x_1     \\
x_3     \\
\vdots  \\
x_{n-1} \\
\end{array}\right]
=
\left[ \begin{array}{c}
y_0     \\
y_1     \\
y_3     \\
\vdots  \\
y_{n-1} \\
\end{array}\right]
\]

If the calculation of the powers of $ \omega_{n} $ is ignored, then the floating-point operation count for the DFT is $ 8n^{2} $.

In 1965 Cooley and Tukey [1] developed the FFT algorithm to calculate the DFT. When $ n $ is a power of 2, the FFT algorithm has the greatly reduced operation count of $ 5n \lg n $. For general $ n $, the operational count is $ O(n \lg n) $. Unknown to Cooley and Tukey at the time, the FFT algorithm was proposed earlier by Gauss in 1805, and published posthumously in 1866 [2].

A derivation of the Cooley-Tukey algorithm is shown in Van Loan [4]. To get an idea of the algorithm, consider the case where $ n $ is a power of 2. Start by substituting $ \omega $ into the DFT equation to get:

\[
y_k = \sum^{n-1}_{j=0} x_j \omega_n^{jk}
\; \; \; \; \;  k = 0, 1, \ldots, n-1 
\]

Now set $ m = n/2 $. For $ j = 0, 1, \ldots, m-1 $, split $ x_{j} $ into even components $ \underline{x}_{j}=x_{2j} $ and odd components $  \overline{x}_{j}=x_{2j+1} $. Substituting into the equation above gives:

\[
y_k = \sum^{m-1}_{j=0} \underline{x}_j   \omega_n^{2jk} +
      \sum^{m-1}_{j=0}   \overline{x}_j \omega_n^{(2j+1)k}
\; \; \; \; \; \; \;  k = 0, 1, \ldots, n-1 
\]

Use the symmetry property $ \omega_{n}^{2jk} = \omega_{m}^{ jk} $ to get:

\[
y_k = \sum^{m-1}_{j=0} \underline{x}_j   \omega_{m}^{jk} + \omega_n^{k}
      \sum^{m-1}_{j=0} \overline{x}_j \omega_{m}^{jk}
\; \; \; \; \;  k = 0, 1, \ldots, n-1 
\]

Next, split $ y_{k} $ into the first $ m $ entries and the second $ m $ entries as follows:

\begin{eqnarray}
y_k & = &
\sum^{m-1}_{j=0} \underline{x}_j \omega_{m}^{jk} + \omega_n^{k}
\sum^{m-1}_{j=0} \overline{x}_j \omega_{m}^{jk}
\; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; k = 0, 1, \ldots, m-1 \nonumber  \\
y_{k+m} & = &
\sum^{m-1}_{j=0} \underline{x}_j \omega_{m}^{j(k+m)} + \omega_n^{k+m}
\sum^{m-1}_{j=0} \overline{x}_j \omega_{m}^{j(k+m)}
\; \; \; \; \;   k = 0, 1, \ldots, m-1  \nonumber
\end{eqnarray}

For the $ y_{k+m} $ equation use the symmetry properties $ \omega_{m}^{jm} = 1 $ and $ \omega_{n}^{m} = -1 $ to get:

\begin{eqnarray}
y_k & = &
\sum^{m-1}_{j=0} \underline{x}_j \omega_{m}^{jk} + \omega_n^{k}
\sum^{m-1}_{j=0} \overline{x}_j \omega_{m}^{jk}
\; \; \; \; \;  k = 0, 1, \ldots, m-1 \nonumber  \\
y_{k+m} & = &
\sum^{m-1}_{j=0} \underline{x}_j \omega_{m}^{jk} - \omega_n^{k}
\sum^{m-1}_{j=0} \overline{x}_j \omega_{m}^{jk}
\; \; \; \; \;   k = 0, 1, \ldots, m-1 \nonumber
\end{eqnarray}

We have taken the original DFT of size $ n $ and an operation count $ 8n^{2} $ and replaced it with 2 DFTs of size $ m=n/2 $ and operation count $ 2n^{2} $. This can be continued for $ \lg n $ steps to replace the original DFT of length $ n $ by $ n $ DFTs of length 1. This is called a radix-2 FFT. The algorithm can be extended beyond powers of 2, to general $ n $. The splitting process above corresponds to the Decimation In Time (DIT) Cooley-Tukey algorithm. There are a total of 8 splittings. The splitting used in the MathKeisan FFT library is the Decimation In Frequency (DIF) Stockham algorithm. A derivation of this algorithm is shown in Van Loan [4]. The DIF Stockham algorithm has an operation count equal to that of Cooley-Tukey, but the order in which the operations are performed is different.

Multiple 1-d FFTs and Vector Length

The FFT calculation proceeds as a series of steps of radix $ n_{1}, n_{2}, \ldots, n_{r} $ . The $ n_{1}, n_{2}, \ldots, n_{r} $ are generally small prime numbers from the factorization $ n = n_{1} n_{2} \cdots n_{r} $.

The vector length during the calculation of the radix $ n_{1} $ step will be $ n_{1} $; at the radix $ n_{2} $ step it will be the product $ n_{1} n_{2} $; at the radix $ p $ step it will be:

\[ \prod_{j=1}^{p}n_j \]

If $ n $ is small, then the vector lengths in the FFT calculation will be small.

Some applications require a number of 1-d FFTs of the same length. If the length of the FFTs is $ n $, and the number of FFTs is $ h $, then the multiple 1-d FFT can be written:


\[
y_{kl} = \sum_{j=0}^{n-1} x_{jl} \omega_n^{jk}   
\; \; \; \; \; k = 0, 1, \ldots, n-1
\; \; \; \; \; l = 0, 1, \ldots, h-1
\]

Here it is understood that $ y_{kl} $ and $ x_{jl} $ are 2-dimensional arrays of complex numbers. In a multiple 1-d FFT, the vector length at the radix $ n_{p} $ step of the calculation will be:

\[ h \prod_{j=1}^{p}n_j \]

This larger vector length gives performance advantages for multiple 1-d over 1-d FFTs.

2-d and 3-d FFTs

The 2-d FFT of the array of complex numbers $ x_{jl} $, for $ j = 0, 1, \ldots, n-1 $ and $ l = 0, 1, \ldots, h-1 $ is the array of complex numbers $ y_{kp} $ given by

\begin{eqnarray}
y_{kp} & = & \sum_{l=0}^{h-1} \sum_{j=0}^{n-1} x_{jl} \omega_h^{lp}
\omega_n^{jk}     
\; \; \; \; \; k = 0, 1, \ldots, n-1  
\; \; \; \; \; p = 0, 1, \ldots, h-1 \nonumber \\
       & = & \sum_{l=0}^{h-1}        \omega_h^{lp} 
             \sum_{j=0}^{n-1} x_{jl} \omega_n^{jk}     
\; \; \; \; \; k = 0, 1, \ldots, n-1  
\; \; \; \; \; p = 0, 1, \ldots, h-1 \nonumber
\end{eqnarray}

Now write:

\[
z_{kl} = \sum_{j=0}^{n-1} x_{jl} \omega_n^{jk}     
\; \; \; \; \; k = 0, 1, \ldots, n-1  
\; \; \; \; \; l = 0, 1, \ldots, h-1 
\]

Then $ y_{kp} $ is given by

\[
y_{kp} = \sum_{l=0}^{h-1} z_{kl} \omega_h^{lp}
\; \; \; \; \; k = 0, 1, \ldots, n-1  
\; \; \; \; \; p = 0, 1, \ldots, h-1 
\]

Here the 2-d FFT is replaced by the two multiple 1-d FFTs of the equations for $ z_{kl} $ and $ y_{kp} $ above.

By similar methods the 3-d FFT can be replaced by three multiple 1-d FFTs. Internally in MathKeisan subroutines 2-d and 3-d FFTs are calculated by calls to multiple 1-d computational kernels. This is known as the row-column algorithm.

Real-to-Complex and Complex-to-Real FFTs

The real-to-complex FFT has symmetry properties which give about a two times improvement in operation count and memory usage over the complex-to-complex FFT. To show the symmetry properties, write the DFT in terms of the trigonometric functions $ \sin $ and $ \cos $ to get:

\[
y_k=\sum_{j=0}^{n-1} x_j \cos \frac{2\pi }{n} jk - i \sum_{j=0}^{n-1}
x_j \sin \frac{2\pi }{n} jk
\; \; \; \; \; k = 0, 1, \ldots, n-1  
\]

If $ x_{j} $ is real, then $ y_{k} $ is complex conjugate symmetric about $ k = n/2 $. This allows for storing only half of the values of $ y_{k} $. The value $ y_{0} $ is refered to as the DC frequency. If $ n $ is even, then the value $ y_{n/2} $ is refered to as the Nyquist frequency. By the identities $ \sin 0 = 0 $ and $ \sin \pi j = 0 $ it can be seen that $ y_{0} $ and $ y_{n/2} $ are real.

For the real-to-complex FFT with real input vector of size $ n $, to obtain the speed and memory advantages, the output vector should be complex with size $ n/2+1 $. For the complex to real FFTs if the input vector is complex with size $ n/2+1 $ , the output vector should be real with size $ n $ . The reason for the size $ n/2+1 $ rather than $ n/2 $ is because the redundant zero parts of the DC and Nyquist frequencies are stored. Some of the LibSci™ and VECLIB interfaces take advantage of space saving, and some do not. For interface information, see the man pages.

Effect of Leading Dimensions

The multiple 1-d, 2-d and 3-d FFT subroutines have data stored in two or three dimensional arrays. If the leading dimension of an array is a high powers of 2, bank conflicts will result in poor performance. It is best to set the leading dimension of the arrays to an odd numbers to avoid bank conflicts.

FFT Routine List

 ?  indicates prefix which must be filled with a combination of:
S = REAL(kind=4), D = REAL(kind=8), C = COMPLEX(kind=4), Z = COMPLEX(kind=8)
  Name Prefixes Description
LibSci ?FFT CC ZZ 1-d Complex-Complex FFT.
?FFT C Z 1-d Complex-Complex FFT.
?FFT2 C Z 1-d Complex-Complex FFT.
?FFT SC DZ 1-d Real-Complex FFT.
?FFT CS ZD 1-d Complex-Real FFT.
?FFT2 RC DZ 1-d Real-Complex FFT.
?FFT2 CR ZD 1-d Complex-Real FFT.
?FFT2D CC ZZ 2-d Complex-Complex FFT.
?FFT2D C Z 2-d Complex-Complex FFT.
?FFT2D SC DZ 2-d Real-Complex FFT.
?FFT2D CS ZD 2-d Complex-Real FFT.
?FFT3D CC ZZ 3-d Complex-Complex FFT.
?FFT3D C Z 3-d Complex-Complex FFT.
?FFT3D SC DZ 3-d Real-Complex FFT.
?FFT3D CS ZD 3-d Complex-Real FFT.
?FFTM CC ZZ Multiple 1-d Complex-Complex FFT.
M?FFT C Z Multiple 1-d Complex-Complex FFT.
?FFTMLT C Z Multiple 1-d Complex-Complex FFT.
?FFTM SC DZ Multiple 1-d Real-Complex FFT.
?FFTM CS ZD Multiple 1-d Complex-Real FFT.
?FFTMLT R D Multiple 1-d Real-Complex, Complex-Real FFT.
?FTFAX F D Generate FFT factors and Calculate Trig tables, Real.
?FTFAX C Z Generate FFT factors and Calculate Trig tables, Complex.
XERFFT   FFT error handler.
VECLIB ?1DFFT C Z 1-d Complex-Complex FFT, input/output Complex.
?1DFFT S D 1-d Complex-Complex FFT, input/output Real.
?RC1FT C Z 1-d Real-Complex, Complex-Real FFT, input/output Complex.
?RC1FT S D 1-d Real-Complex, Complex-Real FFT, input/output Real.
?2DFFT C Z 2-d Complex-Complex FFT, input/output Complex.
?2DFFT S D 2-d Complex-Complex FFT, input/output Real.
?RC2FT C Z 2-d Real-Complex, Complex-Real FFT, input/output Complex.
?RC2FT S D 2-d Real-Complex, Complex-Real FFT, input/output Real.
?3DFFT C Z 3-d Complex-Complex FFT, input/output Complex.
?3DFFT S D 3-d Complex-Complex FFT, input/output Real.
?RC3FT C Z 3-d Real-Complex, Complex-Real FFT, input/output Complex.
?RC3FT S D 3-d Real-Complex, Complex-Real FFT, input/output Real.
?FFTS C Z Multiple 1-d Complex-Complex FFT, input/output Complex.
?FFTS S D Multiple 1-d Complex-Complex FFT, input/output Real.
?RCFTS C Z Multiple 1-d Real-Complex, Complex-Real FFT, input/output Complex.
?RCFTS S D Multiple 1-d Real-Complex, Complex-Real FFT, input/output Real.

Further Reading

  1. J. Cooley, J. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comp. 19 (1965), 297–301.
  2. C.F. Gauss, "Theoria interpolationis methodo nova tracta," Carl Friedrich Gauss Werke, Band3, Koniglichen Gesellschaft de Wissenschafen, Gotingen, 1866.
  3. MLIB User's Guide, vol 2 (follow the Documentation link).
  4. C. Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, Philadelphia, 1992.