Q & A : Benchmarks : OpenCL GPGPU Performance (OpenCL vs. CUDA/STREAM)

Q & A : Benchmarks : OpenCL GPGPU Performance (OpenCL vs. CUDA/STREAM)

Note: This article addresses GPGPU OpenGL performance. For CPU performance, please see OpenCL CPU Performance.

What is OpenCL?

OpenCL is an open standard for running (parallel) tasks on CPUs, (GP)GPUs and hardware accelerators – unlike proprietary solutions from graphics manufacturers (e.g. CUDA, STREAM, etc.). It is cross-platform and thus supported on many operating systems (e.g. OS X “Snow Leopard”, Linux, etc.) – unlike Microsoft’s DirectX 11 CS (Compute Shader/DirectCompute), a similar standard. It also supports various embedded and hand-held devices.

The standard has been recently (May 2009) ratified by The Khronos Group and we are beginning to see support from graphics manufacturers as well as processor manufacturers¹.

Why do we measure its performance?

We believe OpenCL will become the standard to program GPGPUs in spite of existing proprietary solutions; DirectX 11 CS will probably be preferred by most Windows developers who are heavily invested in DirectX and do not support other operating systems.

We have ported all our GPGPU benchmarks to OpenCL and will continue to support new OpenCL implementations as they become available. Together with the existing GPGPU (CUDA, STREAM, etc.) and CPU benchmarks, users can easily measure the performance of the various solutions available today.

OpenCL has been released with Sandra 2009 SP4 while additional benchmarks will be released in a future version of Sandra.

What do the results mean?

The arithmetic results are in pixels/s, i.e. how many pixels can be computed in 1 second.
In all cases, as higher indexes mean better performance (pixels/s) the higher the result the better the performance.
The memory results are in MB/s, i.e. how many MB can be transferred in 1 second.
In all cases, as higher indexes mean better performance (MB/s) the higher the result the better the performance.

Typical Arithmetic Results

Testing the single floating-point (32-bit float) performance of various GPGPUs in OpenCL as well as proprietary solutions (CUDA, STREAM, etc.) reveals quite interesting results.

Environment: Windows Vista x64 SP2; Catalyst 9.11 video / STREAM 1.4.427 / OpenCL 1.0 Beta 4; ForceWare 190.89 video / CUDA 2.3 / OpenCL 1.0 live release.

GPU Name	Cores / Speed / Memory	Native Float / Double Performance	OpenCL Float / Double Performance	Comments
ATI Radeon HD 4850	800 / 625MHz / 512MB	359.7 / 177.7 MPixel/s (CAL)	508.7 / 25.5² MPixel/s	OpenCL allows us to achieve even faster performance than CAL, 50% better which is just incredible!
ATI Radeon HD 5870	1600 / GHz / 1GB	912 / 459.478 MPixel/s (CAL)	1588 / 69.509² MPixel/s	We see again ~50% gains in OpenCL versus CAL, the compiler doing better than us in optimising the code. Fantastic result!
nVidia GeForce 9600 GT	64 / 1.6GHz / 512MB	164.381 / 12.471² MPixel/s (CUDA)	153.104 / 11.783² MPixel/s	Tie: CUDA 7% faster on float test, 6% slower on double-emulation, great result for OpenCL.
nVidia GeForce 9500 GT	32 / 1.4GHz / 512MB	74.319 / 5.807² MPixel/s (CUDA)	70.754 / 5.488² MPixel/s	Tie: CUDA 5% faster on float test, 5% slower on double-emulation, great result for OpenCL.
nVidia ION	16 / 1.1GHz / 128MB	31.897 / 2.332² MPixel/s (CUDA)	30.250 / 2.208² MPixel/s	Tie: CUDA 5% faster on float test, 5% slower on double-emulation, great result for OpenCL.
nVidia GeForce 9400M GS	16 / 800MHz / 128MB	22.607 / 1.720² MPixel/s (CUDA)	18.9 / 1.589² MPixel/s	CUDA slightly faster on both, though by a minor amount ~15%, should improve with newer drivers.

²: Emulated results (through 32-bit float) due to lack of native double (64-bit) floating-point support of tested hardware.

³: Emulated results (through 32-bit float) due to lack of native double (64-bit) floating-point support in OpenCL drivers.

While with the preview OpenCL drivers/run-time CUDA was faster up to 2x, the new conformant release/drivers make a HUGE difference: we have performance parity with CUDA (in some cases OpenCL is faster) which just great! Unless you need a CUDA feature, there is no reason not to port your code to OpenCL now.

The Beta 4 OpenCL is showing incredible results, 50% faster than native CAL/STREAM, showing how writing high-level shaders and letting the compiler optimising the code can be faster than writing shaders by hand! While you should be able to get the same performance through CAL, you need to be pretty skilled in writing low-level shaders and maintain a relatively large piece of code when the OpenCL version is far simpler – allowing you to concentrate on parallelism.

There is no double floating-point support yet, but if you can live with emulation performance is good (even a little faster than CUDA/STREAM); native support should come soon and be just as performant as native double performance.

Typical Memory Bandwidth Results

Testing the bandwidth performance of various current desktop processors and GPGPU-capable video adapters reveals quite interesting results.

Environment: Windows Vista x64 SP2; Catalyst 9.11 video / STREAM 1.4.427 / OpenCL 1.0 Beta 4; ForceWare 190.89 video / CUDA 2.3 / OpenCL 1.0 live release.

GPU Name	Cores / Speed / Memory	Native Internal / Transfer Bandwidth	OpenCL Internal / Transfer Bandwidth	Comments
ATI Radeon HD 4850	800 / 625MHz / 512MB	37.9 / 2.6 GB/s (CAL)	1.5 / 0.6 GB/s	Very slow OpenCL performance, most likely a bug in the current Beta 4 drivers.
ATI Radeon HD 5870	1600 / GHz / 1GB	101.2 / 4.4 GB/s (CAL)	15.2 / 2.6 GB/s	OpenCL internal transfers are very slow, 1/8, most likely a bug in the current Beta 4 drivers.
nVidia GeForce 9500 GT	32 / 1.4GHz / 512MB	12.8 / 5.4 GB/s (CUDA)	12.3 / 5.8 GB/s	CUDA marginally faster ~4%
nVidia ION	16 / 1.1GHz / 256MB	6176 / 2104 MB/s (CUDA)	6224 / 2277 MB/s	OpenCL marginally faster, ~1%
nVidia GeForce 9400M GS	16 / 800MHz / 128MHz	4765 / 1945 MB/s (CUDA)	5384 / 1586 MB/s	OpenCL marginally faster ~12%

Memory transfers through OpenCL are just as fast if not a little faster than through CUDA; the efficiency is greater than 50% of the theoretical hardware bandwidth which is good news for programs using a large data-set. Again, there is no reason not to use OpenCL unless you are using memory access patterns (e.g. zero-copy, etc.) that CUDA makes possible.

The Beta 4 OpenCL driver does not do as well, with transfer performance 1/8 (~12%) of CAL which is very low, most likely a bug in the current drivers. Hopefully by the time the release version is out, the issue will be fixed.

Conclusion

The latest conformant release driver/OpenCL run-time brings performance parity with CUDA, a great improvement on the preview release. The public release (due soon) might even be faster than CUDA and can only get better. There is no reason not to port CUDA code to OpenCL now!

Converting CUDA code to OpenCL is not difficult with the major bonus of being able to run on other GPUs; hardware manufacturers are releasing drivers that allow OpenCL code to run on CPUs or dedicated hardware accelerators, major operating systems like OS X “Snow Leopard” are adding support for it. Just like Java, it is a case of “write once, test everywhere”. Porting to DirectX 11 CS is far harder unless you are already using DirectX 10 and using GPGPU methodology to execute shaders as kernels.

We see little reason to use proprietary frameworks like CUDA or STREAM once public drivers supporting OpenCL are released – unless there are features your code depends on that are not included yet; even then, they will most likely be available as extensions (similar to OpenGL) pretty soon.

¹: For CPU performance, please see OpenCL CPU Performance.