FP16 GPGPU Image Processing Performance & Quality

GPGPU Image Processing

What is FP16 (“half”)?

FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal) while CPU support is still limited to SIMD conversion only (FP16C). It has been added to allow mobile devices (phones, tablets) to provide increased performance (and thus save power for fixed workloads) for a small drop in quality for normal 8-bbc (24-bbp) image and video.

However, normal laptops and tablets with integrated graphics can also benefit from FP16 support in same way due to relatively low graphics compute power and the need to save power due to limited battery in thin and light formats.

In this article we’re investigating the performance differences vs. standard FP32 (aka “single”) and the resulting quality difference (if any) for mobile GPGPUs (Intel’s EV9/9.5 SKL/KBL). See the previous articles for general performance comparison:

Image Processing Performance & Quality

We are testing GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Image Filter
FP32/Single FP16/Half Comments
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  481  967 [+2x] We see a a text-book 2x performance increase for no visible drop in quality.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  107  331 [+3.1x] Using FP16 yields over 3x performance increase but we do see a few more changed pixels though no visible difference.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  112  325 [+2.9x] Again almost 3x performance increase but no visible quality difference. Result!
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  107  323 [+3.1x] Again just over 3x performance increase but no visible quality difference.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s) 5.41  5.67 [+4%] No image difference at all but also almost no performance increase – a measly 4%.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  4.7  13.48 [+2.86x] We’re back with a 2.8x times performance increase but few more differences than we’ve seen though quality seems acceptable.
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  1188  1210 [+2%] Due to random no generation using 64-bit integer processing the performance difference is minimal but the picture quality is not acceptable.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s) 470  508 [+8%] Again due to Perlin noise generation we see almost no performance gain but big drop in image quality – not worth it.

Other Image Processing relating Algorithms

Image Filter
FP16/Half FP32/Single FP64/Double Comments
GPGPU Science Benchmark GEMM OpenCL (GFLOPS)  178 [+50%]  118  35 Dropping to FP16 gives us 50% more performance, not as good as 2x but still a significant increase.
GPGPU Science Benchmark FFT OpenCL (GFLOPS)  34 [+70%]  20  5.4 With FFT we are now 70% faster, closer to the 100% promised.
GPGPU Science Benchmark N-Body OpenCL (GFLOPS)  297 [+49%]  199  35 Again we drop to “just” 50% faster with FP16 but still a great performance improvement.

Final Thoughts / Conclusions

For many image processing filters (Blur, Sharpen, Sobel/Edge-Detection, Median/De-Noise, etc.) we see a huge 2-3x performance increase – more than we’ve hoped for (2x) – with little or no image quality degradation. Thus FP16 support is very much useful and should be used when supported.

However for complex filters (Diffusion, Marble/Perlin Noise) the drop in quality is not acceptable for minor performance increase (2-8%); increasing the precision of more data items to improve quality (from FP16 to FP32) would further drop performance making the whole endeavour pointless.

For those algorithms that do benefit from FP16 the performance improvement with FP16 is very much worth it – so FP16 support is very useful indeed.

Intel Graphics GPGPU Performance

Intel Logo

Why test GPGPU performance Intel Core Graphics?

Laptops (and tablets) are still in fashion with desktops largely left to PC game enthusiasts and workstations for big compute workloads; most laptops (and all tablets) make due with integrated graphics with few dedicated graphics options mainly for mobile PC gamers.

As a result integrated graphics on Intel’s mobile platform is what the vast majority of users will experience – thus its importance is not to be underestimated. While in the past integrated graphics options were dire – the introduction of Core v3 (Ivy Bridge) series brought us a GPGPU-capable graphics processor as well an updated internal media transcoder of Core v2 (Sandy Bridge).

With each generation Intel has progressively improved the graphics core, perhaps far more than its CPU cores – and added more variants (GT3) and embedded cache (eDRAM) which greatly increased performance – all within the same power limit.

New Features enabled by the latest 21.45 graphics driver

With Intel graphics drivers supporting just 2 generations of graphics – unlike unified drivers of AMD and nVidia – old graphics quickly become obsolete with few updates; but Windows 10 “free update” forced Intel’s hand somewhat – with its driver (20.40) supporting 3 generations of graphics (Haswell, Broadwell and latest at the time Skylake).

However, the latest 21.45 driver for newly released Kabylake and older Skylake does bring new features that can make a big difference in performance:

  • Native FP64 (64-bit aka “double” floating-point support) in OpenCL – thus allowing high precision compute on integrated graphics.
  • Native FP16 (16-bit aka “half” floating-point support) in OpenCL, ComputeShader – thus allowing lower precision but faster compute.
  • Vulkan graphics interface support – OpenGL’s successor and DirectX 12’s competitor – for faster graphics and compute.

Will these new features make upgrading your laptop to a brand-new KBL laptop more compelling?

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of the new Intel ULV APUs with the old versions.

Graphics Unit Haswell HD4000 Haswell HD5000 Broadwell HD6100 Skylake HD520 Skylake HD540 Kabylake HD620 Comment
Graphics Core EV7.5 HSW GT2U EV7.5 HSW GT3U EV8 BRW GT3U EV9 SKL GT2U EV9 SKL GT3eU EV9.5 KBL GT2U Despite 4 CPU generations we really have 2 GPU generations.
APU / Processor Core i5-4210U Core i7-4650U Core i7-5557U Core i7-6500U Core i5-6260U Core i3-7100U The naming convention has changed between generations.
Cores (CU) / Shaders (SP) / Type 20C / 160SP 40C / 320SP 48C / 384SP 24C / 192SP 48C / 384SP 23C / 184SP BRW increased CUs to 24/48 and i3 misses 1 core.
Speed (Min / Max / Turbo) MHz 200-1000 200-1100 300-1100 300-1000 300-950 300-1000 The turbo clocks have hardly changed between generations.
Power (TDP) W 15 15 28 15 15 15 Except GT3 BRW, all ULVs are 15W rated.
DirectX CS Support 11.1 11.1 11.1 11.2 / 12.1 11.2 / 12.1 11.2 / 12.1 SKL/KBL enjoy v11.2 and 12.1 support.
OpenGL CS Support 4.3 4.3 4.3 4.4 4.4 4.4 SKL/KBL provide v4.4 vs. verision 4.3 for older devices.
OpenCL CS Support 1.2 1.2 1.2 2.0 2.0 2.1 SKL provides v2 support with KBL 2.1 vs 1.2 for older devices.
FP16 / FP64 Support No / No No / No No / No Yes / Yes Yes / Yes Yes / Yes SKL/KBL support both FP64 and FP16.
Byte / Integer Width 8 / 32-bit 8 / 32-bit 8 / 32-bit 128 / 128-bit 128 / 128-bit 128 / 128-bit SKL/KBL prefer vectorised integer workloads, 128-bit wide.
Float/ Double Width 32 / X-bit 32 / X-bit 32 / X-bit 32 / 64-bit 32 / 64-bit 32 / 64-bit Strangely neither arch prefers vectorised floating-point loads – driver bug?
Threads per CU 512 512 256 256 256 256 Strangely BRW and later reduced the threads/CU to 256.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 288 399 597 875 [+3x] 1500 840 [+2.8x] If FP16 is enough, KBL and SKL have 2x performance of FP32.
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 299 375 614 468 [+56%] 817 452 [+50%] SKL GT3e rules the roost but KBL hardly improves on SKL.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 18.54 (eml) 24.4 (eml) 38.9 (eml) 112 [+6x] 193 104 [+5.6x] SKL GT2 with native Fp64 is almost 4x emulated BRW GT3!
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.8 (eml) 2.36 (eml) 4.4 (eml) 6.34 (eml) [+3.5x] 10.92 (eml) 6.1 (eml) [+3.4x] Emulating Fp128 though Fp64 is ~2.5x faster than through FP32.
As expected native FP16 runs about 2x faster than FP32 and thus provides a huge performance upgrade if precision is sufficient. Native FP64 is about 8x emulated FP64 and even emulated FP128 improves by about 2.5x! Otherwise KBL GT2 matches SKL GT2 and is about 50% faster than HSW GT2 in FP32 and 6x faster in FP64.
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 1.37 1.85 2.7 2.19 [+60%] 3.36  2.21 [+60%] Since BRW integer performance is similar.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1.87 2.45 3.45 2.79 [+50%] 4.3 2.83 [+50%] Not a lot changes here.
SKL/KBL GT2 with integer workloads (with extensive memory accesses) are 50-60% faster than HSW similar to what we saw with floating-point performance. But the changed happened with BRW which improved the most over HSW with SKL and KBL not improving further.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s)  1.2 1.62 4.35  3 [+2.5x] 5.12 2.92 In this tough compute test SKL/KBL are 2.5x faster.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2.86  3.93  9.82  6.7 [+2.34x]  11.26  6.49 With a lighter algorithm SKL/KBL are still ~2.4x faster.
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s)  0.828  1.08 1.68 1.08 [+30%] 1.85  1 64-integer performance does not improve much.
In pure integer compute tests SKL/KBL greatly improve over HSW being no less than 2.5x faster a huge improvement; but 64-bit integer performance hardly improves (30% faster with 20% more CUs 24 vs 20). Again BRW is where the improvements were added with SKL GT3e hardly improving over BRW GT3.
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 461 495 493 656 [+42%]  772 618 [+40%] Pure FP32 compute SKL/KBL are 40% faster.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) 137  238 135 SKL GT3 is 73% faster than GT2 variants
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 62.45 85.76 123 86.32 [+38%]  145.6 82.8 [+35%] In this tough algorithm SKL/KBL are still amost 40% faster.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) 18.65 31.46 19 SKL GT3 is over 65% faster than GT2 KBL.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 106 160.4 192 174 [+64%] 295 166.4 [+56%] M/C is not as tough so here SKL/KBL are 60% faster.
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) 31.61 56 31 GT3 SKL manages an 80% improvement over GT2.
Intel is pulling our leg here; KBL GPU seems to show no improvement whatsoever over SKL, but both are about 40% faster in FP32 than the much older HSW. GT3 SKL variant shows good gains of 65-80% over the common GT2 and thus is the one to get if available. Obviously the ace card for SKL and KBL is FP64 support.
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS)  117  130 142 116 [=]  181 113 [=] SKL/GBL have a problem with this algorithm but GT3 does better?
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) 34.9 64.7 34.7 GT3 SKL manages a 86% improvement over GT2.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 13.3 13.1 15 20.53 [+54%]  27.3 21.9 [+64%] In a return to form SKL/KBL are 50% faster.
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) 5.2  4.19  4.69 GT3 stumbles a bit here some optimisations are needed.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS)  122  157.9 249 201 [+64%]  304 177.6 [+45%] Here SKL/KBL are 50% faster overall.
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) 19.25 31.9 17.8 GT3 manages only a 65% improvement here.
Again we see no delta between SKL and KBL – the graphics cores perform the same; again both benefit from FP64 support allowing high precision kernels to run. GT3 SKL variant greatly improves over common GT2 variant – except in one test (DFFT) that seems to be an outlier.
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  341  432  636 492 [+44%]  641 488 [+43%] We see the GT3s trading blows in this integer test, but SKL/KBL are 40% faster than HSW.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  72.7  92.8  147  106 [+45%]  139  106 [+45%] BRW GT3 just wins this with SKL/KBL again 45% faster.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  75.6  96  152  110 [+45%]  149  111 [+45%] Another win for BRW and 45% improvent for SKL/KBL.
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  72.6  90.6  147  105 [+44%]  143  105 [+44%] As above in this test.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s)  2.38  1.53  6.51  5.2 [+2.2x]  7.73  5.32 [+2.23x] SKL’s GT3 manages a win but overall SKl/KBL are over 2x faster than HSW.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  1.17  0.719  5.83  4.57 [+3.9x]  4.58  4.5 [+3.84x] Another win for BRW
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  511  688  1150  1100 [+2.1x]  1750  1080 [+2.05x]_ SKL/KBL are over 2x faster than HSW. BRW is beat here.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s)  378.5  288  424  437 [+15%]  611  443 [+17%] Some wild results here, some optimizations may be needed.
In this integer workloads (with texture access) the 28W GT3 of BRW manages a few wins over 15W GT3e of SKL – but compared to old HSW – both SKL and KBL are between 40 and 300% faster. Again we see no delta between SKL and KBL – there does not seem to be any difference at all!

If you have a HSW GT2 then an upgrade to SKL GT2 brings massive improvements as well as FP16 and FP64 native support. But HSW GT3 variant is competitive and BRW GT3 even more so. KBL GT2 shows no improvement over SKL GT2 – so it’s not just the CPU core that is unchanged but the graphics core also – it’s no EV9.5 here more like EV9.1!

For integer workloads BRW is where the big improvement came but for 64-integer that improvement is still to come, if ever. At least all drivers support native int64.

Transcoding Performance

We are testing media (video + audio) transcoding performance for common video algorithms: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
H.264/AVC Decoder/Encoder QuickSync H264 8-bit only QuickSync H264 8-bit only QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit HSW supports 8-bit only so 10-bit (high-colour) are out of luck.
H.265/HEVC Decoder/Encoder QuickSync H265 8-bit partial QuickSync H265 8-bit QuickSync H265 8-bit QuickSync H265 8/10-bit SKL has full/hardware H265/HEVC transcoding but for 8-bit only; Main10 (10-bit profile) requires KBL so finally we see a difference.
Transcode Benchmark VC 1 > H264/AVC Transcoding (MB/s)  7.55 8.4  7.42 [-2%]  8.25  8.08 [+6%] With DDR4 KBL is 6% faster.
Transcode Benchmark VC 1 > H265/HEVC Transcoding (MB/s)  0.734  3.14 [+4.2x]  3.67  3.63 [+5x] Hardware support makes SKL/KBL 4-5x faster.

If you want HEVC/H.265 then you want SKL including 4k/UHD. But if you plan on using 10-bit/HDR colour then you need KBL – finally an improvement over SKL. As it uses fixed-point hardware the GT3 performs only slightly faster.

Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX/OpenGL ComputeShader,  including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Memory Configuration 8GB DDR3 1.6GHz 128-bit 8GB DDR3 1.6GHz 128-bit 16GB DDR3 1.6GHz 128-bit 8GB DDR3 1.867GHz 128-bit 16GB DDR4 2.133GHz 128-bit 16GB DDR4 2.133GHz 128-bit All use 128-bit memory with SKL/KBL using DDR4.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 64 2048 / 64 2048 / 64 2048 / 64 Shared memory remains the same; in SKL/KBL constant memory is the same as global.
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 10.4 10.7 11 15.65 23 [+2.1x] 19.6 DDR4 seems to provide over 2x bandwidth despite low clock.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 5.23 5.35 5.54 7.74 11.23 [+2.1x] 9.46 Again over 2x increase in up speed.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 5.27 5.36 5.29 7.42 11.31 [+2.1x] 9.6 Again over 2x increase in down speed.
SKL/KBL + DDR4 provide over 2x increase in internal, up and down memory bandwidth – despite the relatively modern increase in memory speed (2133 vs 1600); with DDR3 1867MHz memory the improvement drops to 1.5x. So if you were to decide DDR3 or DDR4 the choice has been made!
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns)  179 192  234 [+30%]  296 235 [+30%] With DDR4 latency has increased by 30% not great.
GPGPU Memory Latency Constant Memory Latency (ns)  92.5  112  234 [+2.53x]  279  235 [+2.53x] Constant memory has effectively been dropped resulting in a disastrous 2.53x higher latencies.
GPGPU Memory Latency Shared Memory Latency (ns)  80  84  –  86.8 [+8%]  102  84.6 [+8%] Shared memory latency has stayed the same.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns)  283  298  56 [1/5x]
 58.1 [1/5x]
Texture access seems to have markedly improved to be 5x faster.
SKL/KBL global memory latencies have increased by 30% with DDR4 – thus wiping out some gains. The “new” constant memory (2GB!) is now really just bog-standard global memory and thus with over 2x increase in latency. Shared memory latency has stayed pretty much the same. Texture memory access is very much faster – 5x faster likely though some driver optimisations.

Again no delta between KBL and SKL; if you want bandwidth (who doesn’t?) DDR4 with modest 2133MHz memory doubles bandwidths – but latencies increase. Constant memory is now the same as global memory and does not seem any faster.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL as well as memory bandwidth performance.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 250  119 602 [+2.4x] 1000 537 [+2.1x] Fp16 support in DirectX doubles performance.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 235  109 338 [+43%]  496 289 [+23%] Fp16 does not yet work in OpenGL.
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s)  238  120 276 [+16%]  485 248 [4%] We only see a measly 4-16% better performance here.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 228  108 338 [+48%] 498 289 [+26%] SKL does better here – it’s 50% faster than HSW.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 52.4  78 76.7 [+46%] 133 69 [+30%] With FP64 SKL is still 45% faster.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 63.2  67.2 105 [+60%] 177 96 [+50%] Similar result here 50-60% faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 5.2  7 18.2 [+3.5x] 31.3 16.7 [+3.2x] Driver optimisation makes SKL/KBL over 3.5x faster.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 5.55  7.5 57.5 [+10x]  97.7 52.3 [+9.4x] Here we see SKL/KBL over 10x faster!
We see similar results to OpenCL GPGPU here – with FP16 doubling performance in DirectX – but with FP64 already supported in both DirectX and OpenGL even with HSW, KBL and SKL have less of a lead – of around 50%.
Video Memory Benchmark Internal Memory Bandwidth (GB/s)  15  14.8 27.6 [+84%]
26.9 25 [+67%] DDR4 brings almost 50% more bandwidth.
Video Memory Benchmark Upload Bandwidth (GB/s)  7  7.8 10.1 [+44%] 12.34 10.54 [+50%] Upload bandwidth has also increased ~50%.
Video Memory Benchmark Download Bandwidth (GB/s)  3.63  3.3 3.53 [-2%] 5.66 3.51 [-3%] No change in download bandwidth though.

Final Thoughts / Conclusions

SKL and KBL with the 21.45 driver yields significant gains in OpenCL making an upgrade from HSW and even BRW quite compelling despite the relatively modern 20.40 driver Intel was forced to provide for Windows 10. The GT3 version provides good gains over the standard GT2 version and should always be selected if available.

Native FP64 support is a huge addition which provides support for high-precision kernels – unheard of for integrated graphics. Native FP16 support provides an additional 2x performance in cases where 16-bit floating-point processing is sufficient.

However KBL’s EV9.5 graphics core shows no improvement at all over SKL’s EV9 core – thus it’s not just the CPU core that has not been changed but the GPU core too! Except for the updated transcoder supporting Main10 HEVC/H.265 (thus HDR / 10-bit+ colour) which is still quite useful for UHD/4K HDR media.

This is very much a surprise – as while the CPU core has not improved markedly since SNB (Core v2), the GPU core has always provided significant improvements – and now we have hit the same road-block. As dedicated GPUs have continued to improve significantly in performance and power efficiency this is quite a surprise. This marks the smallest ever generation to generation – SKL to KBL – ever, effectively KBL is a SKL refresh.

It seems the rumour that Intel may change to ATI/AMD graphics cores may not be such a crazy idea after all!

AMD A4 “Mullins” APU GPGPU (Radeon R4): Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • GPU: Core remains the same (GCN)

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding AMD GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of the new AMD APU with the old version as well as its competition from Intel.

Graphics Unit CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comment
Graphics Core B-GT EV8 B-GT2Y EV8 GCN GCN There is no change in GPU core in Mullins, it appears to be a re-brand fromm 83XX to R4. But time does not stand still, so while Kabini went against BayTrail’s “crippled” EV7 (IvyBridge) GPU – Mullins must battle the “beefed-up” brand-new EV8 (Broadwell) GPU. We shall see if the old GCN core is enough…
APU / Processor Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) The series has changed but not much else, not even the CPU core.
Cores (CU) / Shaders (SP) / Type 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) 2C / 128SP 2C / 128SP [=] We still have 2 GCN Compute Units but now they go against 16 EV8 units rather than 4 EV7 units. You can see just how much Intel has improved the Atom GPGPU from generation to generation while AMD has not. Will this cost them dearly?
Speed (Min / Max / Turbo) MHz 200 – 600 200 – 800 266 – 496 266 – 500 [=] Nope, clock has not changed either in Mullins.
Power (TDP) W 2.4 (under 4) 4.5 15 15 [=] As before, Intel’s designs have a crushing advantage over AMD’s: both Kabini and Mullins are rated at least 3x (three times) higher power than Core M and as much as 5-6x more than new Atom. Powered-down versions (6W?) would still consume more while performing worse.
DirectX / OpenGL / OpenCL Support 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 11.2 (12?) / 4.5 / 1.2 11.2 (12?) / 4.5 / 2.0 GCN supports DirectX 11.2 (not a big deal) and also OpenCL 4.5 (vs 4.3 on Intel but including Compute) and OpenCL 2.0 (same). All designs should benefit from Windows 10’s DirectX 12. So while AMD supports newer versions of standards there’s not much in it.
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) No / Yes No / Yes Sadly even AMD does not support FP16 (why?) but does support FP64 (double-float) in all interfaces – while Atom/Core GPU only in DirectX and OpenGL. Few people would elect to run heavy FP64 compute on these GPUs but it’s good to know it’s there…
Threads per CU 256 (256x256x256) 512 (512x512x512) 256 (256x256x256) 256 (256x256x256) GCN has traditionally not supported large number of threads-per-CPU (256) and here’s no different, with Intel’s GPU now supporting twice as many (512) – but whether this will make a difference remains to be seen.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (July 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: GPGPU Vectorised
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 160.5 181.7 165.3 163.9 [-1%] Straight off the bat, we see no change in score in Mullins; however, Atom has cought up – scoring within a whisker and Core M faster still (+13%). Not what AMD is used to seeing for sure.
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 160 180 165 163.9 [-1%] As FP16 is not supported by any of the GPUs, unsurprisingly the results don’t change.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 10.1 (emulated) 11.6 (emulated) 13.4 14 [+4%] We see a tiny 4% improvement in Mullins but due to native FP64 support it is almost 40% faster than both Intel GPUs.
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.08 (emulated) 1.32 (emulated) 0.731 (emulated) 0.763 [+4%] (emulated) No GPU supports FP128, but GCN can emulate it using FP64 while EV8 needs to use more complex FP32 maths. Again we see a 4% improvement in Mullins, but despite FP64 support both Intel GPUs are much faster. Sometimes FP64/FP32 ratio is so low that it’s not worth using FP64 and emulation can be faster (e.g nVidia).
AMD Mullins: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 825 770 998 1024 [+3%] In this tough integer workload that uses shared memory (as cache), Mullins only sees a 3% improvement. GCN shows its power being 25% faster than Intel’s GPUs – TDP notwhitstanding.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1106 ? 1280 1423 [+11%] With less rounds, Mullins is now 11% faster – finally a good improvement and again 28% faster than Atom’s GPU.
AMD Mullins: GPGPU Hash
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 309 ? 59 282 [+4.7x] This 64-bit integer compute-heavy wokload seems to have triggered a driver bug in Kabini since Mullins is almost 5x (five times) faster – perhaps 64-bit integer operations were emulated using int32 rather than native? Surprisingly Atom’s EV8 is faster (+9%) – not something we’d expect to see.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 1187 1331 1618 1638 [+1%] In this integer compute-heavy workload, Mullins is just 1% faster (within margin of error) – which again proves GPU has not changed at all vs. older Kabini. At least it’s faster than both Intel GPUs, 38% faster than Atom’s.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2764 ? 2611 3256 [+24%] SHA1 is less compute-heavy but here we see a 24% Mullins improvement, again likely a driver “fix”. This allows it to beat Atom’s GPU 17% – showing that driver optimisations can make a big difference.
AMD Mullins: GPGPU Financial
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 299.8 280.5 248.3 326.7 [+31%] Starting with the financial tests, Mullins flies off with a 31% improvement over the old Kabini, with is just as well as both Intel GPUs are coming strong – it’s 9% faster than Atom’s GPU. One thing’s for sure, Intel’s EV8 GPU is no slouch.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) n/a (no FP64) n/a (no FP64) 21 21.2 [+1%] AMD’s GCN supports native FP64, but here Mullins is just 1% faster than Kabini (within margin of error), unable to replicate the FP32 improvement we saw.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 28 36.5 32.3 30.9 [-4%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and here Mullins somehow manages to be slower (-4%) – likely due to driver differences. Both Intel GPUs are coming strong, with Core M’s GPU 20% faster. Considering how fast the GCN shared memory is – we expected better.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 1.85 1.87 [+1%] Switching to FP64 on AMD’s GPUs, Mullins is now 1% faster (within margin of error). Luckily Intel’s GPUs do not support FP64.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 61.9 54.9 32.9 46.3 [+40%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Mullins is 40% faster here (again driver change) – but surprisingly cannot match either Intel GPUs, with Atom’s GPU 32% faster! Again, we see just how much Intel has improved the GPU in Atom – perhaps too much!
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 5.39 5.59 [+3%] Switching to FP64 we now see a little 3% improvement for Mullins, but better than the 1% we saw before…
AMD Mullins: GPGPU Scientific
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 45 44.1 43.5 41.5 [-5%] GEMM is quite a tough algorithm for our GPUs and Mullins manages to be 5% slower than Kabini – agin this allows Intel’s GPUs to win, with Atom’s GPU just 8% faster – but a win is a win. Mullins’s GPU is starting to look underpowered considering the much higher TDP.
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.11 3.73 [-9%] Swithing to FP64, Mullins now manages to be 5% slower than Kabini – thankfully Intel’s FPUs don’t support FP64…
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 9 8.94 7.89 9.5 [+20%] FFT involves many kernels processing data in a pipeline – and Mullins now manages to be 20% faster than Kabini – again, just as well as Intel’s GPUs are hot on its tail – and it is just 5% faster than Atom’s GPU!
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 2.2 3 [+36%] Switching to FP64, Mullins is now 36% faster than Kabini – again likely a driver improvement.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 65 50 58 63 [+9%] In our last test we see Mullins is 9% faster – but not enough to beat Atom’s GPU which is 1% faster but faster still. Anybody expected that?
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.75 4.74 [=] Switching to FP64, Mullins scores exactly the same as Kabini.

Firstly, Mullins’s GPU scores are unchanged from Kabini; due to driver optimisations/fixes (as well as kernel optimisations) sometimes Mullins is faster but that’s not due to any hardware changes. If you were expecting more, you are to be disappointed.

Intel’s EV8 GPUs in the new Atom (CherryTrail) as well as Core M (Broadwell) now can keep up with it and even beat it in some tests. The crushing GPGPU advantage AMD’s APUs used to have is long gone. Considering the the TDP differences (4-5x higher) the Mullins’s GPU looks underpowered – the number of cores should at least been doubled to maintain its advantage.

Unless Atom (CherryTrail) is more expensive – there’s really no reason to choose Mullins, the power advantage of Atom is hard to be denied. The only positive for AMD is that Core M looks uncompetitive vs. Atom itself, but then again Intel’s 15W ULV designs are far more powerful.

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) AMD h264Encoder (hardware accelerated) AMD h264Encoder (hardware accelerated) Both are using their own hardware-accelerated transcoders for H264.
AMD Mullins: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 5 ? 2 2.14 [+7%] We see a small but useful 7% bandwidth improvement in Mullins vs. Kabini, but even Atom is over 2x (twice) as fast.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 4.75 ? 2 2.07 [+3.5%] When just using the H264 encoder we only see a small 3.5% bandwidth improvement in Mullins. Again, Atom is over 2x as fast.

We see a minor 3.5-7% improvement in Mullins, but the new Atom blows it out of the water – it is over twice as fast transcoding H.264! Poor Mullins/Kaveri are not even a good fit for HTPC (NUC/Brix) boxes…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
Memory Configuration 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) Except Core M, all APUs have a single memory controller, though Atom can also be configured with 2 channels.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 32 64 / 32 Surprisingly AMD’s GCN has 1/2 the shared memory of Intel’s EV8 (32 vs. 64) but considering the low number of threads-per-CU (256) only kernels making very heavy use of shared memory would be affected, still better more than less.
L1 / L2 / L3 Caches (kB) 256kB? L2 256kB? L2 16kB? L1 / 256kB? L2 16kB? L1 / 256kB? L2 Caches sizes are always pretty “hush hush” but since core has not changed, we would expect the same cache sizes – with GCN also sporting a L1 data cache.
AMD Mullins: GPGPU Memory BW
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 11 10.1 8.8 5.5 [-38%] OpenCL memory performance has surprisingly taken a bit hit in Mullins, most likely a driver bug. We shall see whether DirectX Compute is similarly affected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 2.09 3.91 4.1 2.88 [-30%] Upload bandwidth is similarly affected, we measure a 30% decrease.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 2.29 3.79 3.18 2.9 [-9%] Upload bandwidth is the least affected, just 9% lower.
AMD Mullins: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 829 274 1973 693 [-1/3x] Even though the core is unchanged, latency is 1/3 of Kabini. Since we don’t see a comparative increase in performance, this again points to a driver issue.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1669 ? ? 817 Going out-of-page does not increase latency much.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 279 ? ? 377 Sequential access brings the latency down to about 1/2, showing the prefetchers do a good job.
GPGPU Memory Latency Constant Memory Latency (ns) 209 ? 629 401 [-33%] The L1 (16kB) cache does not cover the whole constant memory (64kb) – and is not lower than global memory. There is no advantage to using constant memory.
GPGPU Memory Latency Shared Memory Latency (ns) 201 ? 20 16 [-20%] Shared memory is a little big faster (20% lower). We see that shared memory latency is much lower than constant/global lantency (16 vs. 401) – denoting dedicated shared memory. On Intel’s EV8 GPU there is no change (201 vs. 209) – which would indicate global memory used as shared memory.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1234 ? 2369 691 [-70%] We see a massive latency reduction – again likely a driver optimisation/fix.
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 353 ? ? 353 Sequential access brings the latency down to a quarter (1/4x) – showing the power of the prefetchers.

The optimisations in newer drivers make a big difference – though the same could apply to the previous gen (Kabini). The dedicated shared memory – compared to Intel’s GPUs – likely help GCN achieve its performance.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Shader
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 127.6 ? 128.8 129.6 [=] Starting with DirectX FP32, we see no change in Mullins – not even the DirectX driver has changed.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 121.8 172 124 124.4 [=] OpenGL does not change matters, Mullins scores exactly the same as its predecessor. But here we see Core M pulling ahead, an unexpected change.
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 109.5 ? 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 121.8 170 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change either.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 18 ? 8.9 8.91 [=] Unlike OpenCL driver, DirectX Intel driver does support FP64 – which allows Atom’s GPU to be at least 2x (twice) as fast as Mullins/Kebini.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 26 46 12 12 [=] As above, Intel OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. This allows Atom’s GPU to be over 2x faster than Mullins/Kabini again – while Core M’s GPU is almost 4x (four times!) faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.34 (emulated) ? 1.6 (emulated) 1.66 (emulated) [+3%] Here we’re emulating (mantissa extending) FP128 using FP64: EV8 stumbles a bit allowing Mullins/Kabini to be a little bit faster despite what we saw in FP64 test.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1.1 (emulated) 3.4 (emulated) 0.738 (emulated) 0.738 (emulated) [=] OpenGL does change the results a bit, Atom’s GPU is now faster (+50%) while Core M’s GPU is far faster (+5x). Heavy shaders seem to take their toll on GCN’s GPU.

Unlike GPGPU, here Mullins scores exactly the same as Kabini – neither the DirectX nor OpenGL driver seem to make a difference. But what is different is that Intel’s GPUs support FP64 natively in both DirectX/OpenGL – making it much faster 3-5x than AMD’s GCN. If OpenCL driver were to support it – AMD woud be in trouble!

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 11.18 12.46 8 9.7 [+21%] DirectX bandwidth does not seem to be affected by the OpenCL “bug”, here we see Mullins having 21% more bandwidth than Kaveri using the very same memory. Perhaps the memory controller has seen some some improvements after all.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.83 5.29 3 3.61 [+20%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again Mullins does well with 20% more bandwidth.
Video Memory Benchmark Download Bandwidth (GB/s) 2.1 1.23 3 3.34 [+11%] Download bandwidth improves by 11% only, but better than nothing.

Unlike OpenCL, we see DirectX bandwidth increased by 11-20% – while using the same memory. Hopefuly AMD will “fix” the OpenCL issue which should help kernel performance no end.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mullins’s GPU is unchanged from its predecessor (Kabini) but a few driver optimisations/fixes allow it to score better is many tests by a small margin – however these would also apply to the older devices. There isn’t really more to be said – nothing has really changed.

But time does not stand still – and now Intel’s EV8 GPU that powers the new Atom (CherryTrail) as well as Core M (Broadwell) is hot on its heels and even manages to beat it in some tests – not something we’re used to seeing in AMD’s APUs. Mullins’s GPU is looking underpowered.

If we now remember that Mullins’s TDP is 15W vs. Atom at 2.6-4W or Core M at 4.6W – it’s really not looking good for AMD: it’s CPU performance is unlikely to be much better than Atom’s (we shall see in CPU AMD A4 “Mullins” performance) – and at 3-5x (three to five times) more power woefully power inefficient.

Let’s hope that the next generation APUs (aka Mullins’ replacement) perform better.

Intel Atom X7 (CherryTrail 2015) GPGPU: Closing on Core M?

Intel Logo

What is CherryTrail (Braswell)?

“CherryTrail” (CYT) is the next generation Atom “APU” SoC from Intel (v3 2015) replacing the current Z3000 “BayTrail” (BYT) SoC which was Intel’s major foray into tablets (both Windows & Android). The “desktop” APUs are known as “Braswell” (BRS) while the APUs for other platforms have different code names.

BayTrail was a major update both CPU (OOS core, SSE4.x, AES HWA, Turbo/dynamic overclocking) and GPU (EV7 IvyBridge GPGPU core) so CherryTrail is a minor process shrink – but with a very much updated GPGPU – updated to EV8 (as latest Core Broadwell).

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding Intel GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of 3 processors (BayTrail, CherryTrail and Broadwell-Y) that support GPGPU.

Graphics Unit BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comment
Graphics Core B-GT EV7 B-GT EV8? B-GT2Y EV8 CherryTrail’s GPU is meant to be based on EV8 like Broadwell – the very latest GPGPU core from Intel! This makes it more advanced than the very popular Core Haswell series, a first for Atom.
APU / Processor Atom Z3770 Atom X7 Z8700 Core M 5Y10 Core M is the new Core-Y UULV versions for high-end tablets against the Atom processors for “normal” tablets/phones.
Cores (CU) / Shaders (SP) / Type 4C / 32SP (2×4 SIMD) 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) Here’s the major change: CherryTrail has no less than 4 times the compute units (CU) in the same power envelope of the old BayTrail. Broadwell has more (24) but it is also rated at higher TDP.
Speed (Min / Max / Turbo) MHz 333 – 667 200 – 600 200 – 800 CherryTrail goes down all the way to 200MHz (same as Broadwell) which should help power savings. Its top speed is a bit lower than BayTrail but not by much.
Power (TDP) W 2.4 (under 4) 2.4 (under 4) 4.5 Both Atoms have the same TDP of around 2-2.4W – while Broadwell-Y is rated at 2x at 4.5-6W. We shall see whether this makes a difference.
DirectX / OpenGL / OpenCL Support 11 / 4.0 / 1.1 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 Intel has continued to improve the video driver – 2 generations share a driver – but here CherryTrail has a brand-new driver that supports much newer technologies like DirectX 11.1 (vs 11.0), OpenGL 4.3 (vs 4.0) including Compute and OpenCL 1.2. Broadwell’s driver does support OpenCL 2.0 – perhaps a later CherryTrail driver will do too?
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX) No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) Sadly FP16 support is still missing and FP64 is also missing on OpenCL – but available in DirectX Compute as well as OpenGL Compute! Those Intel FP64 extensions are taking their time to appear…

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (Jun 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: GPGPU Vectorised
GPGPU Arithmetic Benchmark Single/Float/FP32 Vectorised OpenCL (Mpix/s) 25 160 [+6.4x] 181 [+13%] Straight off the bat we see that 4x more advanced CUs in CherryTrail gives us 6.4x better performance a huge improvement! Even the brand-new Broadwell GPU is only 13% faster.
GPGPU Arithmetic Benchmark Half/Float/FP16 Vectorised OpenCL (Mpix/s) 25 160 [+6.4x] 180 [+13%] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
GPGPU Arithmetic Benchmark Double/FP64 Vectorised OpenCL (Mpix/s) 1.63 (emulated) 10.1 [+6.2x] (emulated) 11.6 [+15%] (emulated) None of the GPUs support native FP64 either: emulating FP64 (mantissa extending) is quite hard on all GPUs, but the results don’t change: CherryTrail is 6.2x faster with Broadwell just 15% faster.
GPGPU Arithmetic Benchmark Quad/FP128 Vectorised OpenCL (Mpix/s) 0.18 (emulated) 1.08 [+6x] (emulated) 1.32 [+22%] (emulated) Emulating FP128 using FP32 is even more complex but CherryTrail does not disappoint, it is still 6x faster; Broadwell does pull ahead a bit being 22% faster.
Intel Braswell: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 96 825 [+8.6x] 770 [-7%] In this tough integer workload that uses shared memory CherryTrail does even better – it is 8.6x faster, more than we’d expect – the newer driver may help. Surprisingly this is faster than even Broadwell.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 129 1105 [+8.6x] n/a What we saw before is no fluke, CherryTrail’s GPU is still 8.6 times faster than BayTrail’s.
Intel Braswell: GPGPU Hashing
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 54 309 [+5.7x] This 64-bit integer compute-heavy wokload is hard on all GPUs (no native 64-bit arithmetic), but CherryTrail does well – it is almost 6x faster than the older BayTrail. Note that neither DirectX nor OpenGL natively support int64 so this is about as hard as it gets for our GPUs.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 96 1187 [+12.4x] 1331 [+12%] In this integer compute-heavy workload, CherryTrail really shines – it is 12.4x (twelve times) faster than BayTrail! Again, even the latest Broadwell is just 12% faster than it! Atom finally kicks ass both in CPU and GPU performance.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 215 2764 [+12.8x] SHA1 is less compute-heavy, but results don’t change: CherryTrail is 12.8x times faster – the best result we’ve seen so far.
Intel Braswell: GPGPU Finance
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 34.33 299.8 [+8.7x] 280.5 [-7%] Starting with the financial tests, CherryTrail is quick off the mark – being almost 9x (nine times) faster than BayTrail – and again somehow faster than even Broadwell. Who says Atom cannot hold its own now?
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 5.16 28 [+5.4x] 36.5 [+30%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – but CherryTrail still holds its own, it’s over 5x times faster – not as much as we saw before but massive improvement. Broadwell’s EV8 GPU does show its prowess being 30% faster still.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 3.6 61.9 [+17x] 54 [-12%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; here we see Broadwell shine – it’s 17x (seventeen times) faster than BayTrail’s GPU – so much so we had to recheck. Most likely the newer GPU driver helps – but BayTrail will not get these improvements. Broadwell is again surprisingly 12% slower.
Intel Braswell: GPGPU Science
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 6 45 [+7.5x] 44.1 [-3%] GEMM is quite a tough algorithm for our GPUs but CherryTrail remains over 7.5x faster – even Broadwell is 3% slower than it. We saw before EV8 not performing as we expected – perhaps some more optimisations are needed.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 2.27 9 [+3.96x] 8.94 [-1%] FFT involves many kernels processing data in a pipeline – so here we see CherryTrail only 4x (four times) faster – the slowest we’ve seen so far. But then again Broadwell scores about the same so it’s a tough test for all GPUs.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 10.74 65 [+6x] 50 [-23%] In our last test we see CherryTrail going back to being 6x faster than BayTrail – surprisingly again Broadwell’s EV8 GPU is 23% slower than it.

There is no simpler way to put this: CherryTrail Atom’s GPU obliterates the old one – never being less than 4x and up to 17x (yes, seventeen!) faster, many times even overtaking the much newer, more expensive and more power hungry Broadwell (Core M) EV8 GPU! It is a no-brainer really, you want it – for once Microsoft made a good choice for Surface 3 after the disasters of earlier Surfaces (perhaps they finally learn? Nah!).

There isn’t really much to criticise: sure, FP16 native support is missing – which is a pity on Android (that uses FP16 in UX) and naturally FP64 is also missing – though as usual DirectX compute and OpenGL compute. As mentioned, since OpenGL 4.3 is supported, Compute is also supported for the first time on Atom – a feature recently introduced in newer drivers on Haswell and later GPUs (EV7.5, EV8).

Just in case we’re not clear: this *is* the Atom you are looking for!

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) Same transcoder is used for all GPUs.
Intel Braswell: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 2.24 5 [+2.23x] 8.41 [+68%] H.264 transcoding on the new Atom has more than doubled (2.2x) which makes it ideal as a HTPC (e.g. Plex server). However, with more power we can see that Core M has over 60% more bandwidth.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 1.89 4.75 [+2.51x] 8.2 [+70%] When just using the H264 encoder we still see a 2.5x improvement (over two and a half times), with Core M again about 70% faster still.

Intel has not forgotten transcoding, with the new Atom over 2x (twice) as fast – so if you were thinking of using it as a HTPC (NUC/Brix) server, better get the new one. However, unless you really want low power – the Core M (and thus ULV) versions have are 60-70% faster still…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Memory Configuration 2GB DDR3 1067MHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) Atom is generally configured to use a single memory controller, but CherryTrail runs at 1.6Mt/s same as modern Core APUs. But Core M/Broadwell naturally has a dual-channel controller though some laptops/tablets may use just one.
Cache Configuration 32kB L2 global/texture? 128kB L3 256kB L2 global/texture? 384kB L3 256kB L2 global/texture? 384kB L3 Internal cache arrangement seems to be very secret – so a lot of it is deduced from the latency graphs. The L2 increase in CherryTrail is in line with CU increase, i.e. 8x larger.
Intel Braswell: GPGPU Memory Bandwidth
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 3.45 11 [+3.2x] 10.1 [-8%] CherryTrail manages over 3x higher bandwidth during internal transfer over BayTrail, close to what we’d expect a dual-channel system to achieve. Surprisingly our dual-channel Core M manages to be 8% slower. We did see Broadwell achieve less than Haswell – which may explain what we’re seeing here.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 1.18 2.09 [+77%] 3.91 [+87%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Here CherryTrail improves almost 2x over BayTrail – but finally we see Core M being 87% faster still.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 1.13 2.29 [+2.02x] 3.79 [+65%] While upload bandwidth was the same, download bandwidth has improved a bit more, with CherryTrail being over 2x (twice) faster – but again Broadwell is 65% faster still. This will really help GPGPU applications that need to copy large results from the GPU to CPU memory until “zero copy” feature arrives.
Intel Braswell: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 981 829 [-15%] 274 With the memory running faster, we see latency decreasing by 15% a good result. However, Broadwell does so much better with almost 1/4 latency.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1272 1669 [+31%] Surprisingly, using full-random access we see latency increase by 31%. This could be due to the larger (4GB vs. 2GB) memory arrangement – the TLB-miss hit could be much higher.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 383 279 [-27%] Sequential access brings the latency down by 27% – a good result.
GPGPU Memory Latency Constant Memory Latency (ns) 660 209 [-1/3x] With L1 cache covering the entire constant memory on CherryTrail – we see latency decrease to 1/3 (a third), great for kernels that use more than 32kB constant data.
GPGPU Memory Latency Shared Memory Latency (ns) 215 201 [-6%] Shared memory is a bit faster (6% lower latency), nothing to write home about.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1583 1234 [-22%] With the memory running faster, as with global memory we see latency decreasing by 22% here – a good result!
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 916 353 [-61%] Sequential access brings the latency down by a huge 61%, an even bigger difference than what we saw with Global memory. Impressive!

Again, we see big gains in CherryTrail with bandwidth increasing by 2-3x which is necessary to keep all those new EVs fed with data; Broadwell does do better but then again it has a dual-channel memory controller.

Latency has also decreased by a good amount 6-22% likely due to the faster memory employed, and the much larger caches (8x) do help. For data that exceeded the small BayTrail cache (32kB) – the CherryTrail one should be more than sufficient.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (Jun 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: Video Shaders
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 39 127.6 [+3.3x] Starting with DirectX FP32, CherryTrail is over 3.3x faster than BayTrail – not as high as we saw before but a good start.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 38.3 121.8 [+3.2x] 172 [+41%] OpenGL does not change matters, CherryTrail is still just over 3x (three times) faster than BayTrail. Here, though, Broadwell is 41% faster still…
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 39.2 109.5 [+2.8x] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 38.11 121.8 [+3.2x] 170 [+39%] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 7.48 18 [+2.4x] Unlike OpenCL driver, DirectX driver does support FP64 – so all GPUs run native FP64 code not emulation. Here, CherryTrail is only 2.4x faster than BayTrail.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 9.17 26 [+2.83x] 46.45 [+78%] As above, OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. CherryTrail is 2.8x times faster here, but Broadwell is 78% faster still.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.3 (emulated) 1.34 [+3%] (emulated) (emulated) Here we’re emulating (mantissa extending) FP128 using FP64 not FP32 but it’s hard: CherryTrail’s performance falls to just 3% faster over BayTrail, perhaps some optimisations are needed.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1 (emulated) 1.1 [+10%] (emulated) 3.4 [+3.1x] (emulated) OpenGL does not change the results – but here we see Broadwell being 3x faster than both CherryTrail and BayTrail. Perhaps such heavy shaders are too much for our Atom GPUs.

Unlike GPGPU, here we don’t see the same crushing improvement – but CherryTrail’s GPU is still about 3x (three times) faster than BayTrail’s – though Broadwell shows its power. Perhaps our shaders are a bit too complex for pixel processing and should rather stay in the GPGPU field…

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: Video Memory Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 6.74 11.18 [+65%] 12.46 [+11%] DirectX bandwdith is not as “bad” as OpenCL on BayTrail (better driver?) so we start from a higher baseline: CherryTrail still manages 65% more bandwidth – with Broadwell only squeezing 11% more despite its dual-channel. It shows that OpenCL GPGPU driver has come a long way to match DirectX.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.62 2.83 [+8%] 5.29 [+87%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again BayTrail does better so CherryTrail can only be 8% faster than it – with Broadwell finally 87% faster.
Video Memory Benchmark Download Bandwidth (GB/s) 1.14 2.1 [+83%] 1.23 [-42%] Here BayTrail “stumbles” so CherryTrail can be 83% faster with Broadwell surprisingly 42% slower. What it does show is that the CherryTrail drivers are better despite being much newer. It is a pity Intel does not provide this driver for BayTrail too…

Again, we see big gains in CherryTrail with bandwidth increasing by 2-3x which is necessary to keep all those new EVs fed with data; Broadwell does do better but then again it has a dual-channel memory controller.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Here we tested the brand-new Atom X7 Z8700 (CherryTrail) GPU with 16 EVs (EV8) – 4x (four times) more than the older Atom Z3700 (BayTrail) GPU with 4 EVs (EV7) – so we expected big gains – and they were delivered: GPGPU performance is nothing less than stellar, obliterating the old Atom GPU to dust – no doubt also helped by the newer driver (which sadly BayTrail won’t get). And all at the same TDP of about 2.4-5W! Impressive!

Even the very latest Core M (Broadwell) GPU (EV8) is sometimes left behind – at about 2x higher power, more EVs and higher cost – perhaps the new Atom is too good?

Architecturally nothing much has changed (beside the far more EVs) – but we also get better bandwidth and lower latencies – no doubt due to higher memory bus clock.

All in all, there’s no doubt – this new Atom is the one to get and will bring far better graphics and GPGPU performance at low cost – even overshadowing the great Core M series – no doubt the reason it is found in the latest Surface 3.

To see how the Atom CherryTrail CPU fares, please see CPU Atom Z8700 (CherryTrail) performance article!

Mali T760 GPGPU (Exynos 5433 SoC): FP64 Champion – Adreno Killer?

Samsung Logo

What is Mali? What is Exynos?

“Mali” is the name of ARM’s own “standard” GPU cores that complement the standard CPU cores (Cortex). Many ARM CPU licensors integrate GPU cores from other vendors in their SoCs, e.g. Imagination, Vivante, Adreno rather than the default Mali.

Mali Series 700 is the 3-rd generation “Midgard” core that complement’s ARM’s 64-bit ArmV8 Cortex 5X designs and thus used in the very latest phones/tablets and has been updated to include support for new technologies like OpenCL ES, OpenGL ES and DirectX.

“Exynos” is the name of Samsung’s line of SoCs that is used in Samsung’s own phones/tablets/TVs. Series 5 is the 5-th generation SoC generally using ARM’s “big.LITTLE” architecture of “small” cores for low-power and “big” cores for performance. 5433 is the 1st 64-bit SoC from Samsung supporting AArch64 aka ArmV8 but running in “legacy” 32-bit ArmV7 mode.

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of various modern phones and tablets that support GPGPU.

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
Type / Micro-Arch VLIW4 (Midgard 3nd gen) VLIW5 VLIW5 VLIW5 VLIW4 (Midgard 2nd gen) Scalar (Maxwell 3rd gen) All except K1 are VLIW thus work best with vectorised data; some compilers are very good at vectorising simple code (e.g. by executing mutiple data items simultaneously), but the programmer can generally do a better job of extracting paralellism.
Core Speed (MHz) estimated 600 600 578 400 533 ? Core speeds are comparative with latest devices not pushing the clocks too high but instead improving the cores.
OpenGL ES Support 3.1 3.1 3.0 3.0 3.0 (should support 3.1) 3.1 Mali T7xx adds official support for OpenGL ES 3.1 just like the other modern GPU designs: Adreno 400 and K1. While Mali T6xx should also suppot 3.1 the drivers have not been updated for this “legacy” device.
OpenCL ES Support 1.2 (full) 1.2 (full) 1.1 1.1 1.1 (should support full) Not for Android, supports CUDA Mali T7xx adds support for OpenCL 1.2 but also “full profile” just like Adreno 420 – both supporting all the desktop features of OpenCL – thus any kernels developed for desktop/mobile GPUs can run pretty much unchanged.
CU / SP Units 8 / 256 4 / 128 4 / 128 4 / 64 8 / 64 1 / 192 Mali T760 has 2x the CU of T628 but they should also be far more powerful. Adreno 420 only relies on more powerful CUs over the 330/320; nVidia uses only 1 SMX/CU but more SPs.
Global Memory (MB) 2048 of 3072 1400 of 3072 1400 of 3072 960 of 2048 1024 of 3072 n/a Modern phones with 3GB memory seem to allow about 50% to be allocated through OpenCL. Mali does generally seem to allow more, typically 66%.
Largest Memory Block (MB) 512 of 2048 347 of 1400 347 of 1400 227 of 960 694 of 1024 n/a The maximum block size seems to be about 25% of total memory, but Mali’s driver allows as much as 50%.
Constant Memory (kB) 64 64 4 4 64 n/a Mali T600 was already fine here, with Adreno 400 needed to catch up to the rest. Previously constant data would have needed to be kept in normal global memory due to the small constant memory size.
Shared Memory (kB) 32 32 8 8 32 n/a Again Mali T600 was fine already – with Adreno 400 finally matching the rest.
Max. Workgroup Size 256 x 256 x 256 1024 x 1024 x 1024 512 x 512 x 512 256 x 256 x 256 256 x 256 x 256 n/a Surprisingly the work-group size remains at 256 for Mali T700/T600 with Adreno 400 pushing alll the way to 1024. That does not necessarily mean it is the optimum size.
Cache (Reported) kB 256 128 32 32 n/a n/a Here Mali T760 overtakes them all with a 256kB L2 cache, 2x bigger than Adreno 400 and older Mali T600.
FP16 / FP64 Support Yes / Yes Yes / No Yes / No Yes / No No No Here we are the 1st mobile FP64 native GPU! If you have double floating-point workloads then stop reading now and get a SoC with Mali T700 series.
Byte/Integer Width 16 / 4 1 / 1 1 / 1 1 / 1 16 / 4 n/a Adreno prefers non-vectorised integer data even though it is VLIW5; only Mali prefers vectorised data (vec4) similar to the old ATI/AMD pre-GCN hardware. At least all our vectorisations are not in vain 😉
Float/Double Width 4 / 2 1 / n/a 1 / n/a 1 / n/a 4 / n/a n/a As before, Adreno prefers non-vectorised while Mali vectorised data. As Mali T760 supports FP64, it also wants vectorised double floating-point data.

GPGPU Compute Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
5433 GPGPU Arithmetic
GPGPU Arithmetic Benchmark Half-float/FP36 Vectorised OpenCL (Mpix/s) 199.9 184.4 73.4 20.2
GPGPU Arithmetic Benchmark Single-Float/FP32 Vectorised OpenCL (Mpix/s) 105.9 182 [+71%] 114.8 49.1 20.2 Adreno 420 manages to beat Mali T760 by a good ~70% even though we use a highly vectorised kernel – not what we’ve expected. But this is 5x faster than the old Mali T625 showing just how much Mali has improved since its last version – but not enough!
GPGPU Arithmetic Benchmark Double-float/FP64 Vectorised OpenCL (Mpix/s) 30.6 [+3x] 10.1 (emulated) 8.4 (emulated) 3.4 (emulated) 0.518 (emulated) Here you see the power of native support, Mali T760 blows everything out the water – it is 3x faster than the Adreno 400 and a crazy 60x (sixty times) faster than the old Mali T625! This is really the GPGPU to beat – nVidia must be regretting not adding FP64 to K1.
GPGPU Arithmetic Benchmark Quad-float/FP128 Vectorised OpenCL (Mpix/s) 0.995 [+5x] (emulated using FP64) 0.197 (emulated) 0.056 (emulated) 0.053 (emulated) failed Emulating FP128 using FP64 gives Mali T760 a big advantage – now it is 5x faster than Adreno 400! There is no question – if you have high precision computation to do, Mali T700 is your GPGPU.
soc_5433_gp_crypt
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 136 145 [+6%] 96 70 85 T760 is just a bit (6%) slower than Adreno 420 here, but still good improvement (2x) over its older brother T628.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 200 131 94 89
soc_5433_gp_hash
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 321 [+2%] 314 141 106 89 In this integer compute-heavy workload, Mali T760 just edges Adreno 420 by 2% – pretty much a tie. Both GPGPUs are competitive in integer workloads as we’ve seen in the AES tests also.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 948 442 271 294
soc_5433_gp_fin
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 212.9 235.4 [+10%] 170.7 85 98.3 Black-Scholes is not compute heavy allowing many GPUs to do well, and here Adreno 420 is 10% faster than Mali T760.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 0.842 6.605 [+7x] 4.737 1.477 0.797 Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and Adreno 420 does not disappoint; however Mali T780 (just as T628) is not very happy with our code with a pitiful score that is 1/7x (seven times slower).
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 34 59.5 [+1.75x] 19.1 14.2 10.4 Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Adreno 420 is 1.75x (almost two times) faster. It could be that Mali stumbles at the shared memory operations which are key to both algorithms.
soc_5433_gp_sci
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 5.167 6.179 [+19%] 3.173 2.992 2.362 Adreno 420 continues its dominance here, being ~20% faster than Mali T760 but nowhere near the lead it had in Financial tests.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 1.902 5.470 [+2.87x] 3.535 2.146 1.914 FFT involves a lot of memory accesses but here Adreno 420 is almost 3x faster than Mali T760, a lead similar to what we saw in the complex Financial (Binomial/Monte-Carlo) tests.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 14.3 27.7 [+2x] 23.9 15.9 9.46 N-Body generally allows GPUs to “spread their wings” and here Adreno 420 does not disappoint – it is 2x faster than Mali T760.

It seems our early enthusiasm over FP64 native support was quickly extinguished: while Mali T760 naturally does well in FP64 tests – it cannot beat its rival Adreno (420) in other tests.

In single-precision floating-point (FP32) simple workloads, Adreno is only about 10-20% faster; however in complex workloads (Binomial, Monte-Carlo, FFT, GEMM) Adreno can be 2-7x (times) faster – a huge lead. It seems to do with shared memory accesses rather VLIW design needing highly-vectorised kernels which is what we’re using.

Naturally in double-precision floating-point (FP64) workloads, Mali T760 flies – being 3-5x (times) faster, so if those are the kinds of workloads you require – it is the natural choice. However, such precision is uncommon on phones/tablets – even desktop/laptop GPGPUs have crippled FP64 performance.

In integer workloads, the two GPGPUs are competitive with a 3-5% difference either way.

The relatively small (256) workgroup size may also hamper performance with Adreno 420 able to keep more (1024) threads in flight – although the shared cache size is the same.

GPGPU Memory Performance

We are testing memory bandwidth performance of GPUs using OpenCL, including transfer (up/down) to/from system memory; we also measure the latencies of the various memory types (global, constant, shared, etc.) using different access patterns (in-page random access, sequential access, etc.).

Results Interpretation (Bandwidth): Higher values (MPix/s, MB/s, etc.) mean better performance.

Results Interpretation (Latency): Lower values (ns, clocks, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
Memory Configuration 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 2GB DDR3 (shared with CPU) 3GB DDR2 (shared with CPU) Modern phones and tablets now ship with 3GB – close to the 32-bit address limit. While all SoCs suppport unified memory, neither seem to support “zero copy” or HSA which has recently made it to OpenCL on the desktop with version 2.0. Future SoCs will fix this issue and provide global virtual memory address space that CPU & GPU can share.
soc_5433_gp_mbw
GPGPU Memory Bandwidth Internal Memory Bandwidth (MB/s) 4528 9457 [+2x] 8383 4751 1436 Qualcomm’s SoC manages to extract almost 2x more bandwidth compared to Samsung’s SoC – which may expain some of the performance issues we saw when processing large amounts of data. Adreno has almost 10GB/s bandwidth to play with, similar to single-channel desktop/laptops!
GPGPU Memory Bandwidth Upload Bandwidth (MB/s) 2095 3558 [+69%] 3294 2591 601 Adreno wins again with 70% more upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (MB/s) 2091 3689 [+76%] 2990 2412 691 Again Adreno has over 70% more download bandwidth – it is no wonder it did so well in the compute tests. Mali will have to improve markedly to match.
soc_5433_gp_mlat
GPGPU Memory Latency Global Memory Latency – In Page Random Access (ns) 199.3 [-17%] 239.6 388.2 658.8 625.8 It starts well for Mali T760, with ~20% lower response time over Adreno 420 and almost 3x faster than its old T628 brother – which no doubt helps the performance of many algorithms that access global memory randomly. All modern SoC (805, 5433, 801) show just how much improvement was made in the memory prefetchers in the last few years.
GPGPU Memory Latency Global Memory Latency – Full Random Access (ns) 346.5 [-30%] 493.2 500.2 933.7 815.2 Full random access is tough on all GPUs and here, but Mali T760 manages to be 30% faster than Adreno 420 which has not improved over 330 (same memory controller in 800 series).
GPGPU Memory Latency Global Memory Latency – Sequential Access (ns) 147.2 106.6 [-27%] 98.2 116.2 280.6 With sequential accesses – we finally see Adreno 420 (but also the older 300 series) show their prowess being 27% faster. Qualcomm’s memory prefetchers seem to be doing their job here.
GPGPU Memory Latency Constant Memory (ns) 263.1 70.7 [-1/3.75x] 74.5 103 343 Here Adreno 420’s constant memory has almost 4x lower latency than Mali (thus 1/4 response time) – which may be a clue as to why it is so much faster. Basically it does not seem that constant memory is cached on the Mali but just normal global memory.
GPGPU Memory Latency Shared Memory (ns) 301 30.1 [-1/10x] 47 83 329 Shared memory even more important as it is used to share data between threads – lack of it reduces the work-group sizes that can be used. Here we see Adreno 420’s shared memory having 10x lower latency than Mali (thus 1/10x response time) – no wonder Mali T760 is so much slower in complex workloads that make extensive use of shared memory. Basically shared memory is not *dedicated* but just normal global memory.

Memory testing seems to reveal Mali’s T760 problem: its bandwidth is much lower than Adreno while its key memories (shared, constant) latencies are far higher. It is a wonder how it performs so well actually if the numbers are to be believed – but since Mali T628 scores similarly there is no reason to doubt them.

Adreno T420 has 2x higher internal bandwidth and over 70% more upload/download bandwidth – and since neither supports HSA and thus “zero copy” – it will be much faster the bigger the memory blocks used. Here, Qualcomm’s completely-designed SoC (CPU, GPU, memory controller) pays dividends.

Mali T760’s global memory latency is lower but neither constant nor (more crucially) shared memory seem to be treated differently and thus have similar latencies to global memory; common GPGPU optimisations are thus useless and any commplex algorithm making extensive use of shared memory will be greatly bogged down. ARM should better re-think their approach for the new (T800) Mali series.

Video Shader Performance

We are testing vectorised shader compute performance of the GPUs in OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
soc_5433_gp_vid_aa
Video Shader Benchmark Single-Float/FP32 OpenGL ES (Mpix/s) 49 170.5 [+3.5x%] 114.4 60 33.6 124.7 Finally the K1 can play and does very well but cannot overtake Adreno 420 which also blows the Mali T760 out of the water being over 3.5x faster. We can see just how much shader performance has improved in a few years.
Video Shader Benchmark Half-float/FP16 OpenGL ES (Mpix/s) 54 219.6 [+4x] 115 107.6 32.2 124.4 While Mali T760 finally supports FP16, it does not seem to do much good over FP32 (+10% faster) while Adreno 420 benefits greatly – thus increases its lead to being 4x faster. OpenGL is still not Mali’s forte.
Video Shader Benchmark Double-float/FP64 OpenGL ES (Mpix/s) 2.3 [1/21x] (emulated) 10.6 [+5x] [1/17x] (emulated) 9.4 [1/12x] (emulated) 4.0 [1/15x] (emulated) 2.1 [1/16x] (emulated) 26.0 [1/4.8x] (emulated) While Mali T760 does support FP64, the OpenGL extension is not yet supported/enabled – thus it is forced to run it emulated in FP32. This allows Adreno 420 to be 5x faster – though nVidia’s K1 takes the win.
Video Shader Benchmark Quad-float/FP128 OpenGLES (Mpix/s) n/a n/a n/a n/a n/a n/a Emulating FP128 using FP32 is too much for our GPUs, we shall have to wait for the new generation of mobile GPUs.

Using OpenGL ES allows the K1 to play, but more specifically it shows Mali’s OpenGL prowess is lacking – Adreno 420 is between 4-5x faster – a big difference. FP16 support seems to make no difference while FP64 support is missing in OpenGL thus it cannot play its Ace card. ARM has some OpenGL driver optimisation to make.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mali T760 is a big upgrade over its older T600 series though a lot of the details have not changed. However, it is not enough to beat its rival Adreno 400 series – with native FP64 performance (and thus FP128 emulated) being the only shining example. While its integer workload performance is competitive – floating-point performance in complex workloads (making extensive use of shared memory) is much lower. Even highly vectorised kernels that should help its VLIW design cannot close the gap.

It seems the SoC’s memory controller lets it down, and its non-dedicated shared and constant memory means high latencies slow it down. ARM should really implement dedicated shared memory in the next major version.

Elsewhere its OpenGL shader compute performance is even slower (1/4x Adreno) with FP16 support not helping much and FP64 native support missing. This is a surprise considering its far more competive OpenCL performance. Hopefully future drivers will address this – but considering the T600 performance has remained pretty much unchanged we’re not hopeful.

To see how the Exynos 5433 CPU fares, please see Exynos 5433 CPU (Cortex A57+A53) performance article!

SiSoftware Releases Support for DirectX 11 Compute Shader/DirectCompute

GPGPU Memory Bandwidth

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware Releases Support for DirectX 11 Compute Shader/DirectCompute

 

London, UK, 30th November 2009 – SiSoftware releases its suite of DirectX 11 Compute Shader/DirectCompute GPGPU (General Purpose Graphics Processor Unit) benchmarks as part of SiSoftware Sandra 2010, the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, and networks.

At SiSoftware we are constantly looking out for new technologies with the aim to understand how those technologies can best be benchmarked and analysed. We believe that the industry is seeing a shift from the model where heavy computational workload is processed on a traditional CPU to a model that uses the GPGPU or a combination of GPU and CPU; in a wide range of applications developers are using the power of GPGPU to aid business analysis, games, graphics and scientific applications.

As certain tasks or workloads may still perform better on traditional CPU, we see both CPU and GPGPU benchmarks to be an important part of performance analysis. Having launched the GPGPU Benchmarks with SiSoftware Sandra 2009 with support for AMD CTM/STREAM and nVidia CUDA, we have now ported the benchmark suite to DirectX 11 Compute Shader/DirectCompute.

Compute Shader/DirectCompute is a new programmable shader stage introduced with DirectX 11 that expands Direct3D beyond graphics programming. Windows programmers familiar with DirectX can now use high-speed general purpose computing and take advantage of the large numbers of parallel processors on GPUs. We believe Compute Shader/DirectCompute will become “the standard” for programming parallel workloads in Windows, thus we have ported all our GPGPUs benchmarks to DirectX 11 Compute Shader/DirectCompute.

Below is a quote we would like to share with you:

“As ATI Stream technology grows in popularity with software developers, SiSoftware’s Sandra 2010 benchmark is an increasingly important tool for evaluating GPU compute performance,” said Eric Demers, chief technology officer, graphics products, AMD. “As the only provider of DirectX 11 GPUs today, AMD welcomes SiSoftware’s support for that popular application programming interface in Sandra 2010.”

The SiSoftware DirectX 11 Compute Shader/DirectCompute Benchmarks look at the two major performance aspects:

  • Computational performance: in simple terms how fast it can crunch numbers. It follows the same style as the CPU Multi-Media benchmark using fractal generation as its workload. This allows the user to see the power of the GPGPU in solving a workload thus far exclusively performed on a CPU.
  • Memory performance: this analyses how fast data can be transferred to and from the GPGPU. No matter how fast the processing, ultimately the end result will be affected by memory performance.

Key features

  • 4 architectures natively supported (x86, x64/AMD64/EM64T, IA64/Itanium2, ARM)
  • 6 languages supported (English, French3, German3, Italian3, Japanese3, Russian3)
  • DirectX 11 Compute Shader/DirectCompute
  • Different models of GPUs supported, including integrated GPU + dedicated GPUs.
  • Multi-GPUs supported, up to 8 in parallel.

With each release, we continue to add support and compatibility for the latest technologies. SiSoftware works with hardware vendors to ensure the best support for new emerging hardware.

Notes:

1 Available as Beta at this time, performance cannot be guaranteed.

2 By special arrangement; Enterprise versions only.

3 Not all languages available at publication, will be released later.

Relevant Press Releases

Relevant Articles

For more details, please see the following articles comparing current devices on the market:

Purchasing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please contact your distributor (sales support).

Downloading

For more details, and to download the Lite version, please click here.

Reviewers and Editors

For your free review copies, please contact us.
About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Nearly 700 worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Over 9,000 on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, DirectCompute, x64, ARM, MIPS, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX3, AVX2, AVX, FMA4, FMA, NEON, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit http://www.sisoftware.net, http://www.sisoftware.eu, http://www.sisoftware.info or http://www.sisoftware.co.uk