Mali T760 GPGPU (Exynos 5433 SoC): FP64 Champion – Adreno Killer?

Samsung Logo

What is Mali? What is Exynos?

“Mali” is the name of ARM’s own “standard” GPU cores that complement the standard CPU cores (Cortex). Many ARM CPU licensors integrate GPU cores from other vendors in their SoCs, e.g. Imagination, Vivante, Adreno rather than the default Mali.

Mali Series 700 is the 3-rd generation “Midgard” core that complement’s ARM’s 64-bit ArmV8 Cortex 5X designs and thus used in the very latest phones/tablets and has been updated to include support for new technologies like OpenCL ES, OpenGL ES and DirectX.

“Exynos” is the name of Samsung’s line of SoCs that is used in Samsung’s own phones/tablets/TVs. Series 5 is the 5-th generation SoC generally using ARM’s “big.LITTLE” architecture of “small” cores for low-power and “big” cores for performance. 5433 is the 1st 64-bit SoC from Samsung supporting AArch64 aka ArmV8 but running in “legacy” 32-bit ArmV7 mode.

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of various modern phones and tablets that support GPGPU.

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
Type / Micro-Arch VLIW4 (Midgard 3nd gen) VLIW5 VLIW5 VLIW5 VLIW4 (Midgard 2nd gen) Scalar (Maxwell 3rd gen) All except K1 are VLIW thus work best with vectorised data; some compilers are very good at vectorising simple code (e.g. by executing mutiple data items simultaneously), but the programmer can generally do a better job of extracting paralellism.
Core Speed (MHz) estimated 600 600 578 400 533 ? Core speeds are comparative with latest devices not pushing the clocks too high but instead improving the cores.
OpenGL ES Support 3.1 3.1 3.0 3.0 3.0 (should support 3.1) 3.1 Mali T7xx adds official support for OpenGL ES 3.1 just like the other modern GPU designs: Adreno 400 and K1. While Mali T6xx should also suppot 3.1 the drivers have not been updated for this “legacy” device.
OpenCL ES Support 1.2 (full) 1.2 (full) 1.1 1.1 1.1 (should support full) Not for Android, supports CUDA Mali T7xx adds support for OpenCL 1.2 but also “full profile” just like Adreno 420 – both supporting all the desktop features of OpenCL – thus any kernels developed for desktop/mobile GPUs can run pretty much unchanged.
CU / SP Units 8 / 256 4 / 128 4 / 128 4 / 64 8 / 64 1 / 192 Mali T760 has 2x the CU of T628 but they should also be far more powerful. Adreno 420 only relies on more powerful CUs over the 330/320; nVidia uses only 1 SMX/CU but more SPs.
Global Memory (MB) 2048 of 3072 1400 of 3072 1400 of 3072 960 of 2048 1024 of 3072 n/a Modern phones with 3GB memory seem to allow about 50% to be allocated through OpenCL. Mali does generally seem to allow more, typically 66%.
Largest Memory Block (MB) 512 of 2048 347 of 1400 347 of 1400 227 of 960 694 of 1024 n/a The maximum block size seems to be about 25% of total memory, but Mali’s driver allows as much as 50%.
Constant Memory (kB) 64 64 4 4 64 n/a Mali T600 was already fine here, with Adreno 400 needed to catch up to the rest. Previously constant data would have needed to be kept in normal global memory due to the small constant memory size.
Shared Memory (kB) 32 32 8 8 32 n/a Again Mali T600 was fine already – with Adreno 400 finally matching the rest.
Max. Workgroup Size 256 x 256 x 256 1024 x 1024 x 1024 512 x 512 x 512 256 x 256 x 256 256 x 256 x 256 n/a Surprisingly the work-group size remains at 256 for Mali T700/T600 with Adreno 400 pushing alll the way to 1024. That does not necessarily mean it is the optimum size.
Cache (Reported) kB 256 128 32 32 n/a n/a Here Mali T760 overtakes them all with a 256kB L2 cache, 2x bigger than Adreno 400 and older Mali T600.
FP16 / FP64 Support Yes / Yes Yes / No Yes / No Yes / No No No Here we are the 1st mobile FP64 native GPU! If you have double floating-point workloads then stop reading now and get a SoC with Mali T700 series.
Byte/Integer Width 16 / 4 1 / 1 1 / 1 1 / 1 16 / 4 n/a Adreno prefers non-vectorised integer data even though it is VLIW5; only Mali prefers vectorised data (vec4) similar to the old ATI/AMD pre-GCN hardware. At least all our vectorisations are not in vain 😉
Float/Double Width 4 / 2 1 / n/a 1 / n/a 1 / n/a 4 / n/a n/a As before, Adreno prefers non-vectorised while Mali vectorised data. As Mali T760 supports FP64, it also wants vectorised double floating-point data.

GPGPU Compute Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
5433 GPGPU Arithmetic
GPGPU Arithmetic Benchmark Half-float/FP36 Vectorised OpenCL (Mpix/s) 199.9 184.4 73.4 20.2
GPGPU Arithmetic Benchmark Single-Float/FP32 Vectorised OpenCL (Mpix/s) 105.9 182 [+71%] 114.8 49.1 20.2 Adreno 420 manages to beat Mali T760 by a good ~70% even though we use a highly vectorised kernel – not what we’ve expected. But this is 5x faster than the old Mali T625 showing just how much Mali has improved since its last version – but not enough!
GPGPU Arithmetic Benchmark Double-float/FP64 Vectorised OpenCL (Mpix/s) 30.6 [+3x] 10.1 (emulated) 8.4 (emulated) 3.4 (emulated) 0.518 (emulated) Here you see the power of native support, Mali T760 blows everything out the water – it is 3x faster than the Adreno 400 and a crazy 60x (sixty times) faster than the old Mali T625! This is really the GPGPU to beat – nVidia must be regretting not adding FP64 to K1.
GPGPU Arithmetic Benchmark Quad-float/FP128 Vectorised OpenCL (Mpix/s) 0.995 [+5x] (emulated using FP64) 0.197 (emulated) 0.056 (emulated) 0.053 (emulated) failed Emulating FP128 using FP64 gives Mali T760 a big advantage – now it is 5x faster than Adreno 400! There is no question – if you have high precision computation to do, Mali T700 is your GPGPU.
soc_5433_gp_crypt
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 136 145 [+6%] 96 70 85 T760 is just a bit (6%) slower than Adreno 420 here, but still good improvement (2x) over its older brother T628.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 200 131 94 89
soc_5433_gp_hash
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 321 [+2%] 314 141 106 89 In this integer compute-heavy workload, Mali T760 just edges Adreno 420 by 2% – pretty much a tie. Both GPGPUs are competitive in integer workloads as we’ve seen in the AES tests also.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 948 442 271 294
soc_5433_gp_fin
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 212.9 235.4 [+10%] 170.7 85 98.3 Black-Scholes is not compute heavy allowing many GPUs to do well, and here Adreno 420 is 10% faster than Mali T760.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 0.842 6.605 [+7x] 4.737 1.477 0.797 Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and Adreno 420 does not disappoint; however Mali T780 (just as T628) is not very happy with our code with a pitiful score that is 1/7x (seven times slower).
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 34 59.5 [+1.75x] 19.1 14.2 10.4 Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Adreno 420 is 1.75x (almost two times) faster. It could be that Mali stumbles at the shared memory operations which are key to both algorithms.
soc_5433_gp_sci
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 5.167 6.179 [+19%] 3.173 2.992 2.362 Adreno 420 continues its dominance here, being ~20% faster than Mali T760 but nowhere near the lead it had in Financial tests.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 1.902 5.470 [+2.87x] 3.535 2.146 1.914 FFT involves a lot of memory accesses but here Adreno 420 is almost 3x faster than Mali T760, a lead similar to what we saw in the complex Financial (Binomial/Monte-Carlo) tests.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 14.3 27.7 [+2x] 23.9 15.9 9.46 N-Body generally allows GPUs to “spread their wings” and here Adreno 420 does not disappoint – it is 2x faster than Mali T760.

It seems our early enthusiasm over FP64 native support was quickly extinguished: while Mali T760 naturally does well in FP64 tests – it cannot beat its rival Adreno (420) in other tests.

In single-precision floating-point (FP32) simple workloads, Adreno is only about 10-20% faster; however in complex workloads (Binomial, Monte-Carlo, FFT, GEMM) Adreno can be 2-7x (times) faster – a huge lead. It seems to do with shared memory accesses rather VLIW design needing highly-vectorised kernels which is what we’re using.

Naturally in double-precision floating-point (FP64) workloads, Mali T760 flies – being 3-5x (times) faster, so if those are the kinds of workloads you require – it is the natural choice. However, such precision is uncommon on phones/tablets – even desktop/laptop GPGPUs have crippled FP64 performance.

In integer workloads, the two GPGPUs are competitive with a 3-5% difference either way.

The relatively small (256) workgroup size may also hamper performance with Adreno 420 able to keep more (1024) threads in flight – although the shared cache size is the same.

GPGPU Memory Performance

We are testing memory bandwidth performance of GPUs using OpenCL, including transfer (up/down) to/from system memory; we also measure the latencies of the various memory types (global, constant, shared, etc.) using different access patterns (in-page random access, sequential access, etc.).

Results Interpretation (Bandwidth): Higher values (MPix/s, MB/s, etc.) mean better performance.

Results Interpretation (Latency): Lower values (ns, clocks, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
Memory Configuration 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 2GB DDR3 (shared with CPU) 3GB DDR2 (shared with CPU) Modern phones and tablets now ship with 3GB – close to the 32-bit address limit. While all SoCs suppport unified memory, neither seem to support “zero copy” or HSA which has recently made it to OpenCL on the desktop with version 2.0. Future SoCs will fix this issue and provide global virtual memory address space that CPU & GPU can share.
soc_5433_gp_mbw
GPGPU Memory Bandwidth Internal Memory Bandwidth (MB/s) 4528 9457 [+2x] 8383 4751 1436 Qualcomm’s SoC manages to extract almost 2x more bandwidth compared to Samsung’s SoC – which may expain some of the performance issues we saw when processing large amounts of data. Adreno has almost 10GB/s bandwidth to play with, similar to single-channel desktop/laptops!
GPGPU Memory Bandwidth Upload Bandwidth (MB/s) 2095 3558 [+69%] 3294 2591 601 Adreno wins again with 70% more upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (MB/s) 2091 3689 [+76%] 2990 2412 691 Again Adreno has over 70% more download bandwidth – it is no wonder it did so well in the compute tests. Mali will have to improve markedly to match.
soc_5433_gp_mlat
GPGPU Memory Latency Global Memory Latency – In Page Random Access (ns) 199.3 [-17%] 239.6 388.2 658.8 625.8 It starts well for Mali T760, with ~20% lower response time over Adreno 420 and almost 3x faster than its old T628 brother – which no doubt helps the performance of many algorithms that access global memory randomly. All modern SoC (805, 5433, 801) show just how much improvement was made in the memory prefetchers in the last few years.
GPGPU Memory Latency Global Memory Latency – Full Random Access (ns) 346.5 [-30%] 493.2 500.2 933.7 815.2 Full random access is tough on all GPUs and here, but Mali T760 manages to be 30% faster than Adreno 420 which has not improved over 330 (same memory controller in 800 series).
GPGPU Memory Latency Global Memory Latency – Sequential Access (ns) 147.2 106.6 [-27%] 98.2 116.2 280.6 With sequential accesses – we finally see Adreno 420 (but also the older 300 series) show their prowess being 27% faster. Qualcomm’s memory prefetchers seem to be doing their job here.
GPGPU Memory Latency Constant Memory (ns) 263.1 70.7 [-1/3.75x] 74.5 103 343 Here Adreno 420’s constant memory has almost 4x lower latency than Mali (thus 1/4 response time) – which may be a clue as to why it is so much faster. Basically it does not seem that constant memory is cached on the Mali but just normal global memory.
GPGPU Memory Latency Shared Memory (ns) 301 30.1 [-1/10x] 47 83 329 Shared memory even more important as it is used to share data between threads – lack of it reduces the work-group sizes that can be used. Here we see Adreno 420’s shared memory having 10x lower latency than Mali (thus 1/10x response time) – no wonder Mali T760 is so much slower in complex workloads that make extensive use of shared memory. Basically shared memory is not *dedicated* but just normal global memory.

Memory testing seems to reveal Mali’s T760 problem: its bandwidth is much lower than Adreno while its key memories (shared, constant) latencies are far higher. It is a wonder how it performs so well actually if the numbers are to be believed – but since Mali T628 scores similarly there is no reason to doubt them.

Adreno T420 has 2x higher internal bandwidth and over 70% more upload/download bandwidth – and since neither supports HSA and thus “zero copy” – it will be much faster the bigger the memory blocks used. Here, Qualcomm’s completely-designed SoC (CPU, GPU, memory controller) pays dividends.

Mali T760’s global memory latency is lower but neither constant nor (more crucially) shared memory seem to be treated differently and thus have similar latencies to global memory; common GPGPU optimisations are thus useless and any commplex algorithm making extensive use of shared memory will be greatly bogged down. ARM should better re-think their approach for the new (T800) Mali series.

Video Shader Performance

We are testing vectorised shader compute performance of the GPUs in OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
soc_5433_gp_vid_aa
Video Shader Benchmark Single-Float/FP32 OpenGL ES (Mpix/s) 49 170.5 [+3.5x%] 114.4 60 33.6 124.7 Finally the K1 can play and does very well but cannot overtake Adreno 420 which also blows the Mali T760 out of the water being over 3.5x faster. We can see just how much shader performance has improved in a few years.
Video Shader Benchmark Half-float/FP16 OpenGL ES (Mpix/s) 54 219.6 [+4x] 115 107.6 32.2 124.4 While Mali T760 finally supports FP16, it does not seem to do much good over FP32 (+10% faster) while Adreno 420 benefits greatly – thus increases its lead to being 4x faster. OpenGL is still not Mali’s forte.
Video Shader Benchmark Double-float/FP64 OpenGL ES (Mpix/s) 2.3 [1/21x] (emulated) 10.6 [+5x] [1/17x] (emulated) 9.4 [1/12x] (emulated) 4.0 [1/15x] (emulated) 2.1 [1/16x] (emulated) 26.0 [1/4.8x] (emulated) While Mali T760 does support FP64, the OpenGL extension is not yet supported/enabled – thus it is forced to run it emulated in FP32. This allows Adreno 420 to be 5x faster – though nVidia’s K1 takes the win.
Video Shader Benchmark Quad-float/FP128 OpenGLES (Mpix/s) n/a n/a n/a n/a n/a n/a Emulating FP128 using FP32 is too much for our GPUs, we shall have to wait for the new generation of mobile GPUs.

Using OpenGL ES allows the K1 to play, but more specifically it shows Mali’s OpenGL prowess is lacking – Adreno 420 is between 4-5x faster – a big difference. FP16 support seems to make no difference while FP64 support is missing in OpenGL thus it cannot play its Ace card. ARM has some OpenGL driver optimisation to make.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mali T760 is a big upgrade over its older T600 series though a lot of the details have not changed. However, it is not enough to beat its rival Adreno 400 series – with native FP64 performance (and thus FP128 emulated) being the only shining example. While its integer workload performance is competitive – floating-point performance in complex workloads (making extensive use of shared memory) is much lower. Even highly vectorised kernels that should help its VLIW design cannot close the gap.

It seems the SoC’s memory controller lets it down, and its non-dedicated shared and constant memory means high latencies slow it down. ARM should really implement dedicated shared memory in the next major version.

Elsewhere its OpenGL shader compute performance is even slower (1/4x Adreno) with FP16 support not helping much and FP64 native support missing. This is a surprise considering its far more competive OpenCL performance. Hopefully future drivers will address this – but considering the T600 performance has remained pretty much unchanged we’re not hopeful.

To see how the Exynos 5433 CPU fares, please see Exynos 5433 CPU (Cortex A57+A53) performance article!

Exynos 5433 64-bit Cortex A57/A53: Krait Killer?

Samsung Logo

What is Exynos?

“Exynos” is the family of mobile SoCs from Samsung; the CPU cores in the modern versions are Qualcomm’s own “Krait” (though some are standard ARM Cortex cores) while the (GP)GPU core is Qualcomm’s own “Adreno” – unlike competing ARM SoCs which generally contain standard ARM CPU and GPU designs.

There are various series, with series 800 (Prime) representing the top of the range, with lower numbered series (e.g. 600, 400, 200, etc.) representing lower performance. Within the same series higher numbers (e.g. 805, 801, 800, etc.) represent newer generation and generally better performance and more features.

The CPU cores are called “Krait” and are Qualcomm’s own design under ARM licence – they are not standard ARM Cortex cores. The latest 400 series shares many features to the Cortex A15 – though some features are similar to the older Cortex A8/A9.

In this article we test CPU core (Krait) performance; please see our other articles on:

Hardware Specifications

We are comparing the internal CPU cores of various modern SoCs in the latest phones and tablets.

SoC Specifications Samsung Exynos 5433 / Samsung Galaxy Note 4C Qualcomm Snapdragon 805 / Samsung Galaxy Note 4F Qualcomm Snapdragon 801 / Sony XPeria Z3 Qualcomm Snapdragon 600 / Samsung Galaxy S4 LTE Samsung Exynos 5420 / Samsung Note 10 – 2014 Edition Comments
CPU Arch / ARM Arch Cortex A57+A53 ARMv8-A Krait 450 (APQ8084) ARMv7-A Krait 400 (MSM8974-AC) ARMv7-A Krait 300 (MSM8960) ARMv7-A Cortex A15+A7 ARMv7-A While the Cortex A5x series are 64-bit, the OS of Note 4 runs in 32-bit mode, thus ARMv7 normal code. It is unclear whether there will ever be a 64-bit version for this phone.
Cores (CU) / Threads (SP) 4C + 4c / 8 threads simultaneously 4C / 4 threads 4C / 4 threads 4C / 4 threads 4C + 4c / 4 threads Except Exynos which is big.LITTLE and has 4 big and 4 little cores, all other CPUs are quad-core. However, the Exynos 5433 can actually run 8 threads at the same time vs. 4 threads for the other CPUs including the older 5420.
Speed (Min / Max / Turbo) (MHz) 400-1900 (400-1300 / 700-1900) 300-2650 300-2466 384-1890 250-1900 (500-1300 / 600-1900) We see Krait 400 pushing close to 3GHz while Cortex designs hover around 2GHz, thus relying on compute power
L0D / L0I Caches (kB) n/a 4x 4kB 4x 4kB 4x 4kB n/a All Kraits have very small L0 caches while Cortex is a more traditional design.
L1D / L1I Caches (kB) 2x 4x 32kB 4x 16kB 4x 16kB 4x 16kB 2x 4x 32kB Cortex has 2x larger L1 caches than Krait but supposedly a bit slower.
L2 Caches (MB) 2MB 2MB 2MB 2MB 2MB All designs have the same size L2 cache.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (Neon2, Neon, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Native Benchmarks Samsung Exynos 5433 / Cortex A57+A53 Qualcomm Snapdragon 805 / Krait 450 Qualcomm Snapdragon 801 / Krait 400 Qualcomm Snapdragon 600 / Krait 300 Samsung Exynos 5420 / Cortex A15+A7 Comments
Exynos 5433 CPU Arithmetic
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 17.07 17.24 [+1%] 14.7 10.3 14.5 Here both 5433 and 805 are neck-and-neck within 1% difference. Despite its much higher clock (+40%) the Krait 450 just keeps up with the latest Cortex A57.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 90 [+7%] 84 74 62 92 5433 flexes its FP muscles here, being 7% faster than 805 despite the much higher clock. While double-precision floating-point workloads are uncommon on mobile/tablets, its use is increasing as more complex apps are ported.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 162 [+3%] 157 136 108 73 With FP64 VFP code, the 5433 is only 3% faster.
Despite its very high clock (+40%), both CPUs are pretty much within 3-7% of each other. Naturally 5433 also supports ARMv8 64-bit but is forced to run in legacy ARMv7 mode.
Exynos 5433 CPU Multi-Media
CPU Multi-Media Vectorised SIMD Native Integer (Int32) Multi-Media (Mpix/s) 22.48 Neon [+45%] 15.5 Neon 12.4 Neon 10.7 Neon 15.1 Neon Krait never seemed to do very well with SIMD (Neon) code and here we see 5433 being 45% faster than 805, the largest we’ve seen so far. ARM has really improved SIMD performance in modern Cortex cores with A15 already handily beating Krait designs. Qualcomm needs to overhaul the SIMD units to remain competitive.
CPU Multi-Media Vectorised SIMD Native Long (Int64) Multi-Media (Mpix/s) 4.3 Neon [+67%] 2.57 Neon 2.19 Neon 1.86 Neon 2.47 Neon With 64-bit Neon workload we see 5433 pull ahead, 67% faster than the 805! For integer SIMD multi-media code, Cortex is the core to beat!
CPU Multi-Media Vectorised SIMD Native Quad-Int (Int128) Multi-Media (kpix/s) 932 [+27%] 730 681 520 577 With normal int64 code, the 5433 still leads but that lead falls to 27%. It woud naturally do better in 64-bit mode if it were running an 64-bit OS.
CPU Multi-Media Vectorised SIMD Native Float/FP32 Multi-Media (Mpix/s) 20.2 Neon [+25%] 16.2 Neon 14.13 Neon 10.45 Neon 13.57 Neon Switching to floating-point Neon SIMD code, the 5433 is still 25% faster over 805.
CPU Multi-Media Vectorised SIMD Native Double/FP64 Multi-Media (Mpix/s) 7.59 [+31%] 5.78 4.6 3.89 4.16 Switching to FP64 VFP code (Neon does support FP64 in ARMv8), 5433 is still 31% faster than 805.
CPU Multi-Media Vectorised SIMD Native Quad-Float/FP128 Multi-Media (kpix/s) 301 [=] 295 257 184 190 In this heavy algorithm using FP64 to mantissa extend FP128, we finally have the 5433 slowing down just matching the 805. Cortex’s power is realised with SIMD code.
With highly-optimised Neon SIMD code, the Cortex A57 that powers 5433 makes mince-meat out of the 805’s Krait being between 25-67% faster despite the much lower clock speed. Qualcomm really needs to improve those SIMD units or risk being badly left behind. Naturally if the 5433 were running in ARMv8 64-bit mode the difference would be much higher.
Exynos 5433 CPU Crypto
GPGPU Crypto Benchmark Crypto SHA2-512 (MB/s) 118 Neon [+2.26x%] 52 Neon 45 Neon 32 Neon 67 Neon Starting with this tough 64-bit Neon SIMD accelerated hashing algorithm, 5433 again flexes its SIMD muscles beating the 805 over 2x (2.26x faster). It shows just how much better modern Cortex cores are executing SIMD code.
GPGPU Crypto Benchmark Crypto AES-256 (MB/s) 136 147 [+8%] 130 90 146 In this non-SIMD workload, the 805 manages to be 8% faster – a surprising result. While the 5433 does support AES HWA that is only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto SHA2-256 (MB/s) 332 Neon [+46%] 227 Neon 225 Neon 148 Neon 186 Neon Switching to a 32-bit Neon SIMD, 5433 is on top again beating the 805 by 46%. Again, Cortex A5x does support SHA HWA but only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 216 [+30%] 166 145 109 165 Less rounds do seem to make a bit of a difference with 5433 now winning by 30% over the 805.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 362 Neon [+14%] 315 Neon 250 Neon 213 Neon 297 Neon SHA1 is the “lightest” compute workload and here 5433 is only 14% faster.
Again in SIMD Neon code the 5433 shows its power, beating the 805 between 14-126% similar to what we saw in the Mandelbrot tests. Naturally 5433 also supports both AES and SHA HWA but only in ARMv8 mode which needs a 64-bit OS. Here x86 does better as all instruction sets are available in both x86 and x64 unlike ARM who conveniently seems to forget about the 32-bit world.
Exynos 5433 CPU Financial
CPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11.79 [+42%] 8.29 5.36 5.4 6.12 As this algorithm does not use SIMD, the 5433 still manages to handily beat the 805 by 42%.
CPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 6.11 [+47%] 4.14 2.83 3.24 3.28 Switching over to FP64 code, the 5433 still manages to be 47% faster – the 805 just cannot get a break!
CPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1.26 2.03 [+61%] 1.76 1.29 1.98 Binomial uses thread shared data thus stresses the cache & memory system; here finally we see the 805 pull ahead by 61%, a big win considering past results.
CPU Finance Benchmark Binomial double/FP64 (kOPT/s) 1.26 1.85 [+46%] 1.71 1.53 2.41 Switching to FP64 code the 805 still wins but by just 46%. It seems this is the kind of algorithm it prefers.
CPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 2.51 [+2x] 1.26 1.42 1.04 1.14 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; the fortunes are reversed again as 5433 is now 2x (twice) as fast as the 805.
CPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1.87 [+2.08x] 0.897 0.711 0.883 1.23 And finall FP64 code does not make any difference, again the 5433 is 2x as fast.
The financial tests generally favour the 5433 which is between 40-100% faster than the 805, except in the “tough” binomial test where the 805 is between 40-60% faster. Even in VFP code the Cortex A5X is the core to beat!
Exynos 5433 CPU Science
CPU Science Benchmark SGEMM (MFLOPS) float/FP32 3906 Neon [+9%] 3579 Neon 3644 Neon 2626 Neon 4889 Neon In this complex Neon SIMD workload we would expect the 5433 to lead, and it does but only by 9%. It seems again that memory accesses slow it down and some of the 8 threads may be starving.
CPU Science Benchmark DGEMM (MFLOPS) double/FP64 1454 [+2.05x] 707 797 547 531 Neon does not support FP64 thus all CPUs use VFP code; here 5433 shows its power being over 2x (twice) faster than the 805.
CPU Science Benchmark SFFT (GFLOPS) float/FP32 720 Neon 989 Neon [+37%] 919 Neon 620 Neon 708 Neon FFT also uses SIMD and thus Neon but stresses the memory sub-system more: as we saw in Binomial, the 805 in the lead by 37%.
CPU Science Benchmark DFFT (GFLOPS) double/FP64 457 586 [+28%] 550 401 399 With FP64 VFP code, the 805 still leads by 28%. It seems the memory sub-system of the 5433 lets it down.
CPU Science Benchmark SNBODY (GFLOPS) float/FP32 758 Neon [+80%] 420 Neon 331 Neon 342 Neon 465 Neon N-Body simulation is SIMD heavy but has many memory accesses to shared data, but read-only – allows the 5433 to win again by 80%. It seems read-only data is not a problem, but read/modify/write is.
CPU Science Benchmark DNBODY (GFLOPS) double/FP64 339 [+46%] 232 183 145 199 With FP64 VFP code see the 5433 still winning but by just 46%.
The results mirror what we saw in the Financial tests: whenever thread-shared memory is used that is read/modified/written – the 5433 slows down, no doubt the extra 4 threads don’t help matters and likely slow it down.
Exynos 5433 CPU Multi-Core
CPU Multi-Core Benchmark Inter-Core Bandwidth (MB/s) 3994 [+10%] (but ~500/core) 3599 (but ~899/core) 2950 (but ~737/core) 2133 (but ~533/core) 1349 (but ~337/core) One thing that Qualcomm does very well is memory performance, both CPU and GPU-wise. But here 5433 has 4 more cores which helps it muscle out its rival with 10% more bandwidth. But while it technically wins, the bandwidth per core is just ~500MB/s while 805 has ~899MB/s, almost 2x more bandwidth. We see how all these caches perform in the Snapdragon 805 Cache and Memory performance article.
CPU Multi-Core Benchmark Inter-Core Latency (ns) 287 121 [-57%] 118 162 128 Latency, however, is much higher – or in other words 805 is 57% faster. It will be interesting to see whether this is due to different core transfer (e.g. big-2-LITTLE) or even between the same type (big-2-big / LITTLE-2-LITTLE).

The 5433 with its modern Cortex A5X design as well as 8-theads walks all over the 805 despite being clocked much lower – especially in SIMD (Neon) tests it is up to 2x (twice) as fast. Only in algorithms that make extensive use of shared thread data and read/modify/write it – the 805 catches a break and is faster.

It will be interesting to see whether the extra 4 threads (aka little cores) just get in the way in these tests and put too much strain on the memory system; effectively it may be better to use just 4 threads (aka BIG cores). We will investigate this in a future article.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java which is what Android and its apps are running. While key compute code will naturally be native, the rest of the code will naturally run on the JVM.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

JVM Benchmarks Samsung Exynos 5433 / Cortex A57+A53 Qualcomm Snapdragon 805 / Krait 450 Qualcomm Snapdragon 801 / Krait 400 Qualcomm Snapdragon 600 / Krait 300 Samsung Exynos 5420 / Cortex A15+A7 Comments
Exynos 5433 Java Arithmetic
Java Arithmetic Java Dhrystone (GIPS) 13.4 [+56%] 8.55 3.85 4.44 3.03 Unlike native Dhrystone (where we saw a minor delta), the Java version is undoubtely faster on the 5433 by over 50% in a clear win. We expected the Krait to do better in integer workloads.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 88 [+76%] 50 37 30 22 For FP64 JVM code, the 5433 is now 76% faster. While there may not be many FP64 Java workloads, the performance is there if you need it.
Java Arithmetic Java Whetstone final/FP32 (GFLOPS) 106 [+2.2x] 48 45 41 28 Switching to single-precision floating-point code, 5433 is even faster – over 2x 805.
The 5433 advantage increases with every test, be it integer, floating-point – crushing the 805.
Exynos 5433 Java Vectorised
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 4668 [+48%] 3147 3803 2201 2675 While vectorised code would normally be native, there may be apps using normal Java code and here the 5433 is almost 50% faster than 805 which somehow ends up slower than its older 801 brother. We put this down to JVM/Android differences (5.0.1 vs. 5.0.2).
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 2803 [+46%] 1917 1141 1168 With 64-bit integer vectorised workload, we see a similar delta of 46% in 5433’s favour.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 6053 [+88%] 3224 2587 1806 Switching to single-precision (FP32) floating-point code, the delta increases again to 88% – 5433 is almost 2x as fast!
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 6003 [+86%] 3219 2626 1636 We see the same thing here, with the 5433 enjoying a 86% advantage.
Vectorised Java code perfoms similar to non-vectorised Java in the previous test.

While native code showed some surprises, here the 5433 is the undisputed champion – beating the 805 in all tests by a wide margin of 50% to over 100% (2x as fast). For pure Java apps the 5433 should feel a lot faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is not really a surprise that the latest ARMv8 64-bit 8-core Cortex A57+A53 (albeit running in 32-bit ARMv7 mode) in Exynos 5433 would dominate the ageing Krait 400-series core in Snapdragon 805 – but the latter’s 40% higher clock could have thrown a few “wobblies”.

Unlike earlier big.LITTLE designs, all 8-cores can be used simultaneously – but this may actually present a problem when using static work allocators as the “big” cores may wait for the “LITTLE” cores to finish – in effect having 8 little cores. We will be exploring the differences in performance when using just the “big” cores, just the “LITTLE” cores or all in a future article.

It is naturally a pity that the 5433 does not use a 64-bit Android version and thus benefit from all the ARMv8 improvements, not to mention new instruction sets like AES HWA, SHA HWA, FP64 Neon and so on. It seems that Samsung (like other vendors we may add) may never actually release a 64-bit OS/ROM for it – and thus the 5433 like other Cortex A5x SoCs are destined to run 32-bit for their whole life… Without 64-bit binary drivers there may not be a way for 3-rd party developers (modders?) to make a 64-bit OS either…

However, even under these circumstances the Note 4-powered Exynos is the most powerful Note (CPU-wise) – though the roles seem to be reversed when comparing the GPUs as we saw in the previous article Exynos 5433 (Mali) GPGPU performance. Thus the decision as to which Note 4 to choose is more difficult – do you want CPU or GPU power? As lots of compute tasks are moving to GPGPU (even on tablet/phones) – we would lean towards GPU prowess… Don’t forget to consider memory performance which we’re invesigating in the next article Exynos 5433 Cache and Memory performance.