Benchmarking : OpenCL CPU Performance (OpenCL vs native/Java/.Net)

BenchMultiCore Benchmarking : OpenCL CPU Performance (OpenCL vs native/Java/.Net)

Note: This article addresses CPU OpenGL performance. For GPGPU performance, please see OpenCL GPGPU Performance.

What is OpenCL?

OpenCL is an open standard for running (parallel) tasks on CPUs, (GP)GPUs and hardware accelerators. It is cross-platform and thus supported on many operating systems (e.g. OS X “Snow Leopard”, Linux, etc.). While not primarily intended for current CPUs (for which many parallel frameworks exist), the same code can run on multiple CPUs, GPUs or a combination of the two allowing huge flexibility and scalability that cannot easily be duplicated.

While future GPUs are losing almost all fixed graphics functionalities and CPUs becoming more and more vectorised, the differences will be less, OpenCL brings unification today.

The standard has been recently (May 2009) ratified by The Khronos Group and we are beginning to see support from processor manufacturers as well.

Why do we measure its performance?

While we believe OpenCL will become the standard to program GPGPUs, processor manufacturers have also released frameworks that allow OpenCL code to run on CPUs; this allows systems without dedicated GPUs to run OpenCL code on the CPU.

We have ported all our GPGPU benchmarks to OpenCL and will continue to support new OpenCL implementations as they become available. Together with the existing CPU benchmarks, users can easily measure the performance of the various solutions available today.

OpenCL has been released with Sandra 2009 SP4 while additional benchmarks will be released in a future version of Sandra.

What do the results mean?

  1. The arithmetic results are in pixels/s, i.e. how many pixels can be computed in 1 second.
  2. In all cases, as higher indexes mean better performance (pixels/s) the higher the result the better the performance.
  3. The memory results are in MB/s, i.e. how many MB can be transferred in 1 second.
  4. In all cases, as higher indexes mean better performance (MB/s) the higher the result the better the performance.

Typical Arithmetic Results

Testing the single floating-point (32-bit float) performance of various CPUs in OpenCL and various frameworks (native, virtual machine run-times like Java, .Net) reveals quite interesting results.

Environment: Windows Vista SP2, Server 2003 SP2, AMD CPU OpenCL 1.0 preview.

CPU Name Cores / Speed / Memory SSE2 Float / Double Performance .Net Float / Double Performance Java Float / Double Performance OpenCL Float / Double Performance Comments
AMD Phenom II X4 925 4 / 2.8GHz / 2GB 47.9 / 26.2 MPixel/s 12.8 / 3.1 MPixel/s 9.88 / 9 MPixel/s 3825 / 3932 kPixels/s 2.6-3.4 times slower than un-vectorised .Net/Java, and 12.6 times slower than vectorised SSE2. Not a great results considering non-AMD CPUs take less of a hit.
AMD Phenom 9750 4 / 2.4GHz / 2GB 41.2 / 22.5 MPixel/s 11 / 2.6 MPixel/s 8 / 8.11 MPixel/s 3075 / 2412 kPixels/s 2.66-3.66 times slower than .Net/Java and 13.7 times slower than SSE2. The results show that Phenom II can perform better than the original – another reason to consider upgrading.
Intel Atom 230 1-2T / 1.6GHz / 2GB 6.62 / 1.24 MPixel/s 0.7 / 0.233 MPixel/s 1.27 / 1.2 MPixel/s 717 / 592 kPixels/s .Net takes a hit on Atom allowing OpenCL parity with it, though Java is 2x faster and SSE2 ~9x faster. Hyper-Threading does help, but this is due to the processor architecture (simpler, not able to keep all its units utilised).
Intel Core 2 QX9650 4 / 3GHz / 2GB 62.4 / 31.2 MPixel/s 14 / 5 MPixel/s 11.3 / 10.7 MPixel/s 5144 / 4542 kPixels/s 2.2-2.7 times slower than un-vectorised .Net/Java, and 12.2 times slower than vectorised SSE2, best result.
Intel Core i7 965 4-8T / 3.2GHz / 3GB 120.5 / 62 MPixel/s 28 / 8.3 MPixel/s 23.3 / 22.4 MPixel/s 8319 / 8832 kPixels/s 2.8-3.4 times slower than .Net/Java, and 14.5 times slower than SSE2! Hyper-Threading does not seem to help much, though that is not a concern for AMD.

2: Emulated results (through 32-bit float) due to lack of native double (64-bit) floating-point support in current OpenCL drivers.

Firstly, AMD deserves a lot of credit for being the first with an OpenCL framework for CPUs; the GPGPU version is still outstanding, however. Other manufacturers should bring OpenCL frameworks out soon and we will update the review once this happens.

We have to ignore double performance due to lack of support; with emulated performance being 10 times slower, it is not an option at this time. Once support will be added, performance should be similar to floating-point code.

We are also ignoring non-AMD systems as they are probably not tested or optimised by AMD but they are useful to see how the framework behaves on them; there does not appear to be any bias: the OpenCL framework is just as fast (we mean slow) on non-AMD hardware.

Overall, OpenCL vectorised code is 2-4 times slower than un-vectorised Java/.Net code which is quite a hit. Both Java/.Net have JITs (Just-In-Time Compilers) as well as optimisers and OpenCL also provides for a JIT and optimisers; as OpenCL code can be written in vectorised fashion (like our code) it should be easier for a compiler to use SIMD instructions (e.g. SSE2) to schedule several instructions in parallel.

Alternatively, more than 1 thread could be executed at the same time – as the kernel is the same for all threads: e.g. 4 threads using floats (4x 32-bit = 128-bit-wide SSE registers), 2 threads using double (2x 64-bit = 128-bit-wide SSE registers), etc.

Catching up with vectorised SSE2 code will be a challenge as we’re looking at 10 or more times slower at this time.

Typical Memory Bandwidth Results

Testing the bandwidth performance of various current desktop processors reveals quite interesting results. We have tested both single and multi-threaded performance.

Environment: Windows Vista SP2, Server 2003 SP2, AMD CPU OpenCL 1.0 preview.

Device Name Cores / Speed / Memory Native Bandwidth Multi-Threaded Native Bandwidth OpenCL Bandwidth Comments
AMD 790FX; AMD Phenom II 925; 2x 1GB PC2-8500 DDR2 (1066MHz) 4 / 2.8GHz / 2GB 6.88 GB/s 13 GB/s 4554 MB/s 52% slower than native is acceptable, though not as good as the i7 result.
AMD RS780; AMD Phenom 9750; 2x 1GB PC2-6400 DDR2 (800MHz) 4 / 2.4GHz / 2GB 3.62 GB/s 8.15 GB/s 2164 MB/s 57% slower than native is still acceptable; the result shows the improved memory controller of Phenom II.
nVidia Ion; Intel Atom 230; 2x 1GB PC2-6400 DDR2 (800MHz) 1-2T / 1.6GHz / 2GB 2.4 GB/s n/a 1646 MB/s 66% slower is somewhat less than acceptable (50% or less), though AMD is not likely going to be concerned by performance on non-AMD hardware.
Intel X38; Intel QX9650; 2x 2GB PC3-10666 DDR3 (1333MHz) 4 / 3GHz / 2GB 7 GB/s 7.3 GB/s 4227 MB/s 67% slower is somewhat less than acceptable (50% or less), though AMD is not likely going to be concerned by performance on non-AMD hardware.
Intel X58; Intel i7 965; 3x 1GB PC2-8500 DDR3 (1066MHz) 4-8T / 3.2GHz / 3GB 11.4 GB/s 18.3 GB/s 9110 MB/s Just 25% slower than native, best result of them all.

We are ignoring multi-threaded performance as memory transfers do not appear to be threaded; while modern multi-core processors require threading to obtain the best performance, it is early days to expect it. Compare the multi-threaded native to single-threaded native results to see just how much CPUs with integrated memory controllers (AMD Athlon/Phenom/Opteron, Intel i5/i7, etc.) benefit from multi-threaded memory transfers.

We are also ignoring non-AMD systems as they are probably not tested or optimised by AMD but they are useful to see how the framework behaves on them; there does not appear to be any bias here either, with some non-AMD hardware performing better.

Overall OpenCL is between 25-67% slower than native transfers which is acceptable; only a little improvement is needed to bring it up to par. We would like to see multi-threaded transfers on CPUs with integrated memory controllers.

Conclusion

Double floating-point is not an option at this time due to lack of support, with emulation being too slow to be usable.

Float (32-bit) OpenCL vectorised performance is around 50% slower than un-vectorised .Net/Java, which is acceptable but over 10 times slower than vectorised SSE2 code. If you already have .Net or Java ports of your algorithms and you’re happy with their performance, an OpenCL port should be acceptable. With better compilers it should only get faster: OpenCL forces you to structure code and data for SIMD, thus the compiler/optimiser should be able to extract more parallelism from the code.

If performance is required, there is no substitute at this time for using SIMD (e.g. intrinsics) and structuring data to suit. Future AVX/FMA instructions and wider (256-bit) registers allow even more gains for SIMD code – and future OpenCL compilers will also take advantage of this for better performance.

Let’s not forget, the same code (with minor modifications) can also run on a GPGPU – which are built-in even in modern integrated graphics chipsets! You could choose at run-time which device(s) to use or even use both types! No other programming framework allows you to do this.

We’re pretty excided by OpenCL and cannot wait for the combined CPU/GPGPU framework from AMD and other vendors.

1: For GPGPU performance, please see OpenCL GPGPU Performance.

 

Comments are closed.