CUFFT vs FFTW comparisonThis benchmark was done in the same fashion as benchFFT, comparing complex-complex single precision FFT speeds between FFTW and the CUDA CUFFT library. Transform sizes were limited by the amount of device memory on the GPU. "mflops" is a derived quantity and is not a measured flop count, see the benchFFT page for details. The reported peak performance by NVIDIA as of Nov. 2007 for a GPU (8800 GTX?) doing FFTs on the device only (no memory transfer from host) is 52GF. The test machine is a dual-socket dual-core Intel(R) Xeon(R) CPU 5150 @ 2.66GHz, with an Intel 5000X Chipset and an NVIDIA Quadro FX 4600 GPU in a PCI Express 1.0 x16 slot. Tests were done with fftw-3.1.2 compiled with sse optimizations for both serial and threaded (using 4 threads) transforms, using the FFTW_MEASURE method to generate plans. The CUFFT 1.1 library and CUDA 1.1 toolkit were used on the card. Timings of both the raw device speed as well as including the overhead associated with copying memory to and from the GPU are included. CodeThe code can be found here. Overview of resultsFor small ffts (<8192 elements), CUFFT performs much slower than FFTW, even in serial. Likely due to there not being much work, as the transform fits in cpu cache. Batching (multiple transform plan) results in much better speedup on the GPU, I expect this is where the 52GF number comes from (see the gpgpu.org forum). For larger ffts, CUFFT results in up to a 5x speedup over threaded FFTW (up to 10x speedup over serial FFTW) With memory transfers included, CUFFT is <2x faster than threaded FFTW, and actually slower for 3d transform For non-power of 2 FFTs, CUFFT is at most 2x faster than threaded FFTW It doesn't make much sense to use this as an FFT accelerator, have to increase the arithmetic intensity of processing done on the GPU to make it useful .Plots1 dimensional![]() ![]() ![]() ![]() 2 dimensional![]() ![]() ![]() ![]() 3 dimensional![]() ![]() ![]() ![]() |