ARM big.LITTLE: The trouble with heterogeneous multi-processing: when 4 are better than 8 (or when 8 is not always the “lucky” number)

ARM Logo

What is big.LITTLE?

ARM’s big.LITTLE arch made big waves on launch way back in 2011; it was heralded as a great way to combine performance with power: a “big” (computationally powerful but power hungry) core was to be coupled with a “LITTLE” (slow but low-power) core – and switched as needed. The way they would be switched has changed as both the SoCs and the OS scheduler managing it all have evolved – but as we shall see not necessarily for the better.

The “poster child” for big.LITTLE has been Samsung’s Exynos family of SoCs as used in their extremely popular range of Galaxy phones and tablets (e.g. S4, S5; Note 3, 4; Tab Pro, S; etc.) so we shall use them to discuss – and benchmark – various big.LITTLE switching techniques.

Clustered Switching (CS)

CS was the first type of switching used that arranged the big cores into a “big cluster” and LITTLE cores into a “LITTLE cluster” – with the whole clusters switched in/out as needed. Exynos 5410 SoC (e.g. in Galaxy S4) thus had 4 Cortex A15 cores in the big cluster and 4 Cortex A7 in the LITTLE cluster – 8 cores in total – but only 4 cores could be used at any one time:

big.Little IKS
Working out when to switch clusters optimally was difficult – especially with users demanding “no lag” but also good battery time: you want quick ramping of speed on events (fast interaction) but also quick sleep. Do it too conservatively you get “lag”; do it too many times you use too much power.

On single-threaded workloads, 1 high-compute thread would require the whole big cluster to be switched on thus wasting power (theoretically the other 3 cores could have been parked/disabled but the scheduler did not implement this) or run slower on the LITTLE cluster.

But it was “symmetric multi-processing” (SMP) – aka all cores were the same (either all big or all LITTLE), thus making parallel code “easy” with symmetric workloads for each core.

In-Kernel Switching (IKS)

IKS added granularity by arranging each big and LITTLE core in one cluster – with the cores within each cluster switched in/out as needed. Exynos 5420 SoC (e.g. in Note 4) thus had 4 clusters, each made up of one Cortex A15 + one Cortex A7 core – 8 cores in total – but again only 4 cores could be used at any one time:

big.Little IKS
Working out when to switch cores in each cluster was thus easier depending on the compute workload on each cluster – some clusters may have the big core active while others may have the LITTLE core active.

While we no longer had “true SMP” – programmers could still assume it was SMP (aka all cores the same) with the schduler hopefully switching the same cores within each cluster given symmetrical workload on each cluster.

Heterogeneous Multi-Processing (HMP) aka Global Task Scheduling (GKS)

HMP is really a fancy name for “asymmetric multi-processing” (AMP) where not all the cores are the same and there are no more clusters (at least hardware wise) – all cores are visible. Exynos 5433 (e.g. Note 4) thus has 4 Cortex A57 + 4 Cortex A53 cores – 8 cores in total with all 8 on-line if needed.

It is also possible to have different number of big and LITTLE cores now, e.g. Exynos 5260 had 2 big and 4 LITTLE cores, but normally the count is the same:

big.Little HMP
The scheduler can thus allocate high-compute threads on the big cores and low-compute threads on the LITTLE cores or park/disable any core to save power and migrate threads to the other cores. Server OSes have done this for years (both allocating threads on different CPUs/NUMA nodes first for performance or grouping threads on a few CPUs to park/disable the rest to save power).

Unfortunately this pushes some decisions to the app (or algorithm implementation library) programmers: is it better to use all (8 here) threads – or 4 threads? Shall the scheduling be left to the OS or the app/library? It is definitely a headache for programmers and won’t make SoC makers too happy as their latest chip won’t show the performance they were hoping for…

Is it better to use all or just the big cores? Is 4 better than 8?

While in some cultures “8” is a “lucky number” – just witness the proliferation of “octa core” SoCs almost without reason (Apple’s iPhones are still dual-core and slow they are not) – due to HMP we shall see that this is not always the case.

In x86-world, Hyper-Threading created similar issues: some computers/servers had HT disabled as workloads were running faster this way (e.g. games). Early OSes did not distribute threads “optimally” as the scheduler had no knowledge of HT (e.g. 1 core was be assigned 2 theads while the other cores were free) – thus some apps/libraries (like Sandra itself) had to use “hard scheduling” on the “right” cores or use less threads than (maximum) OS threads.

For phones/tablets there is a further issue – SoC power/thermal limits: using all cores may use more power than just the big cores – causing some (or all) cores to be “throttled” and thus run slower.

For parallel algorithms where each thread has the same workload (symmetric or “static work allocator”) using all cores can mean that the big cores will finish first and thus wait for the LITTLE cores to finish – effectively ending up with all LITTLE cores:

If the big cores are 2x or more faster – you were better off just the big cores; even if the big cores are not as fast – using all cores may (as detailed above) hit SoC power/thermal limits causing all cores to run slower than if less cores were used.

Parallel algorithms may thus need to switch to more complex asymmetric (or “dynamic work allocator”) that monitors the performance of each thread (within a time-frame as threads could be migrated to a different core by the scheduler) and assign work accordingly.

Alternatively, implementations may thus have to make decisions based on hardware topology or run quick tests/benchmarks in order to optimise themselves for best performance on the target hardware. Or hope for the best and leave the scheduler to work things out…

Hardware Specifications

Using “hard affinity” aka scheduling threads on specific cores, we have tested the 3 scenarios (only big cores, only LITTLE cores, all cores) on the HMP Exynos 5433 SoC (in the Note 4) across all the benchmarks in Sandra in order optimise each algorithm for best performance.

SoC Specifications Samsung Exynos 5433 / All 8 cores Samsung Exynos 5433 / 4 big cores Samsung Exynos 5433 / 4 LITTLE cores Comments
CPU Arch / ARM Arch 4x Cortex A57 + 4x Cortex A53 4x Cortex A57 4x Cortex A53 We shall see how much more powerful the A57 cores are compared to the A53.
Cores (CU) / Threads (SP) 4C + 4c / 8 threads simultaneously 4C / 4 threads 4c / 4 threads With HMP can have 8 threads active when all cores are in use. Whether we need them remains to be seen.
Core Features Combined Pipelined (15 depth), out-of-order, 3-way superscalar, 2-level branch prediction Pipelined (8 depth), in-order, 2-way superscalar, simple branch prediction The A57 is “built-for-speed” with a more complex architecture that can be up to 2x faster clock-for-clock.
Speed (Min / Max / Turbo) (MHz) 400-1900 700-1900 [+46%] 400-1300 The A57 are not only faster but can also scale up to 46% higher, pushing close to 2GHz.
L1D / L1I Caches (kB) 2x 4x 32kB 4x 32kB (2-way set) / 4x 48kB (3-way set) 4x 32kB Both cores have the same L1 caches
L2 Caches (MB) 2x 2MB 2MB (16-way set) 2MB All designs have the same size L2 cache.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (Neon2, Neon, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Native Benchmarks Samsung Exynos 5433 / All 8 cores Samsung Exynos 5433 / 4 big cores Samsung Exynos 5433 / 4 LITTLE cores Comments
ARM big.LITTLE CPU AA
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 17.07 [+46% vs 4C] 11.65 [+34% vs 4c] 8.64 Using all cores benefits greatly: they’re even 46% faster than just using the 4 big cores. The big cores are only 34% faster than the LITTLE cores, we were expecting more.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 90 [+42% vs 4C] 63 [+34% vs 4c] 47 We see similar stats in the floating-point test; another win for the 8 cores and small difference in the big/LITTLE performance.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 162 [+20% vs 4C] 135 [+64% vs 4c] 82 With FP32, we see the 8 cores being only 20% faster than just the 4 big cores, with the big cores 64% faster than the LITTLE ones.
Using all cores is undoubtedly faster than just big cores – between 20-45% faster. The big cores themselves are 35-65% faster than the LITTLE cores – but that should not be a surprise being clocked 46% faster. Their performance seems limited somewhat.
ARM big.LITTLE CPU MM
CPU Multi-Media Vectorised SIMD Native Integer (Int32) Multi-Media (Mpix/s) 22.48 Neon [+8.5% vs 4C] 20.72 Neon [+71% vs 4c] 12.12 Neon SIMD (Neon) code – integer workload – shows the power of the big cores – they are 71% faster than the LITTLE ones! But using all cores is yet 8.5% faster – not a lot but faster.
CPU Multi-Media Vectorised SIMD Native Long (Int64) Multi-Media (Mpix/s) 4.3 Neon [+43% vs 4C] 3.0 Neon [+27% vs 4c] 2.36 Neon The 64-bit Neon workload is hard on cores, here the big cores are only 27% faster – we would expect a bit better. Thus using all cores is a whopping 43% faster.
CPU Multi-Media Vectorised SIMD Native Quad-Int (Int128) Multi-Media (kpix/s) 932 [+14% vs 4C] 813 [+75% vs 4c] 463 With normal int64 code, the the big cores are again 75% faster than the LITTLE cores, similar to what we saw in Neon/32 – but using all cores is still 14% faster. So far using all cores is always faster.
CPU Multi-Media Vectorised SIMD Native Float/FP32 Multi-Media (Mpix/s) 20.2 Neon [+25% vs 4C] 16.13 Neon [+49% vs 4c] 10.79 Neon Switching to floating-point Neon SIMD code, the big cores show their power again – being ~50% faster than the LITTLE cres; but using all cores is 25% faster still!
CPU Multi-Media Vectorised SIMD Native Double/FP64 Multi-Media (Mpix/s) 7.59 [+45% vs 4C] 5.22 [+20% vs 4c] 4.03 Switching to FP64 VFP code (Neon does support FP64 in ARMv8), the big cores are just 20% faster; thus using all cores is 45% faster still.
CPU Multi-Media Vectorised SIMD Native Quad-Float/FP128 Multi-Media (kpix/s) 301 [+12% vs 4C] 267 [+71% vs 4c]] 156 In this heavy algorithm using FP64 to mantissa extend FP128, the big cores show their power again – they are 71% faster – but using all cores is 12% faster still. Even with floating-point all cores are still faster.
With highly-optimised Neon SIMD code, the big Cortex A57 cores are between 27-75% faster than the LITTLE A53 cores, whether integer or floating-point loads. But using all 8 cores is still faster by as little as 8% up to 45%. Regardless, 8 cores are always faster.
ARM big.LITTLE CPU Crypto
GPGPU Crypto Benchmark Crypto SHA2-512 (MB/s) 118 Neon [+25% vs 4C] 94 Neon [+31% vs 4c] 72 Neon Starting with this tough 64-bit Neon SIMD accelerated hashing algorithm, we see big cores 31% faster but all cores 25% faster still. Again, we would expect a little more from the big cores.
GPGPU Crypto Benchmark Crypto AES-256 (MB/s) 136 [-17% vs 4C] 163 [+2.4x vs 4c] 68 In this non-SIMD workload, we see the big cores being an incredible 2.4x faster than the LITTLE cores – and finally using all cores is 17% slower than just using the big cores. Naturally all cores support AES HWA but only in ARMv8 mode. In x86 World we saw HT systems supporting AES HWA slower so this is not entirely a surprise.
GPGPU Crypto Benchmark Crypto SHA2-256 (MB/s) 332 Neon [+26% vs 4C] 262 Neon [+57% vs 4c] 167 Neon Switching to a 32-bit Neon SIMD, we see the big cores 57% faster (more than clock difference) – but still using all cores is 26% faster. Again, all cores support SHA HWA but only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 216 [+1% vs 4C] 214 [+2.35x vs 4c] 91 Less rounds do seem to make a a bit dfference, big cores are again about 2.4x faster than the LITTLE cores but but using all cores is about the same. It seems the LITTLE cores just take bandwidth/power away from the big cores.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 362 Neon [-1% vs 4C] 363 Neon [+44% vs 4c] 252 Neon SHA1 is the “lightest” compute workload but here the big cores are 44% faster than their LITTLE brothers – and using all cores is 1% slower.
In streaming algorithms we finally see the big cores making a difference – they are over 2x (twice) as fast as the LITTLE cores – so using all cores is slower. If AES HWA or SHA HWA were available we’d likely see an even bigger differece. Just as in the x86 World with HT – hardware accelerated streaming algorithms need less threads and more bandwidth – here the little cores just get in the way. Unfortunately the 64-bit OS for the SoC will have to wait… indefinitely…
ARM big.LITTLE CPU Finance
CPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11.79 [+45% vs 4C] 8.15 [+37% vs 4c] 5.93 As this algorithm does not use SIMD, the VFP power of the big cores makes only a 37% difference; thus using all cores is 45% faster still.
CPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 6.11 [+72% vs 4C] 3.55 [+14% vs 4c] 3.1 Switching over to FP64 code, the big cores have a tough time, they’re only 14% faster – while using all cores is a huge 72% faster.
CPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1.26 [-25% vs 4C] 1.66 [+2.62x vs 4c] 0.63 Binomial uses thread shared data thus stresses the cache & memory system; here 4 big cores are clarly faster than all 8 by 25%; the big cores themselves are 2.6x times faster thn the LITTLE ones.
CPU Finance Benchmark Binomial double/FP64 (kOPT/s) 1.26 [-18% vs 4C] 1.52 [+2.43x vs 4c] 0.625 Switching to FP64 code does not change things much, the big cores are 2.4x faster than the LITTLE ones and 18% faster than using all cores. More is not always better it seems.
CPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 2.51 [+74% vs 4C] 1.44 [+13% vs 4c] 1.28 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; somehow the big cores don’t work so well here, they are only 13% faster than the LITTLE ones – thus using all cres is 74% faster. More cores make the difference here.
CPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1.87 [+67% vs 4C] 1.12 [+15% vs 4c] 0.97 And finally FP64 code does not make any difference, the big cores are just 15% faster with all cores being a huge 67% faster than them.
The financial tests generally favour the 8-core configuration, except the “tough” binomial test where the big cores are between 18-25% faster than using all the cores. Such read/modify/write algorithms cause bottlenecks in the cache/memory system where feeding 8 cores is a lot more difficult than just 4 and it shows.
ARM big.LITTLE CPU Science
CPU Science Benchmark SGEMM (MFLOPS) float/FP32 3906 Neon [-25% vs 4C] 5229 Neon [+2.64x vs 4c] 1979 Neon In this complex Neon SIMD workload we would expect the 5433 to lead, and it does but only by 9%. It seems again that memory accesses slow it down and some of the 8 threads may be starving.
CPU Science Benchmark DGEMM (MFLOPS) double/FP64 1454 [+42% vs 4C] 1023 [+26% vs 4c] 810 Neon does not support FP64 in ARMv7 mode, so the VFPs do all te work: the big ores are now just 26% faster with all the cores being 42% faster still. In ARMv8 mode it is likely the results would mirror the SGEMM ones.
CPU Science Benchmark SFFT (GFLOPS) float/FP32 720 Neon [-24% vs 4C] 943 Neon [+2.22x vs 4c] 423 Neon FFT also uses SIMD and thus Neon but stresses the memory sub-system more: we see similar results to SGEMM with the big cores being 2.22x faster and using all cores is 24% slower. Again it is likely that the memory sub-system cannot keep up with 8 cores.
CPU Science Benchmark DFFT (GFLOPS) double/FP64 457 [-22% vs 4C] 583 [+2.04x vs 4c] 285 With FP64 VFP code, we would expect a similar result to DGEMM – but instead we see a re-run of SFFT: big cores are over 2x faster than LITTLE cores with all 8 cores being 22% slower. Using Neon is not likely to change the results based on SFFT.
CPU Science Benchmark SNBODY (GFLOPS) float/FP32 758 Neon [+90% vs 4C] 398 Neon [+29% vs 4c] 308 Neon N-Body simulation is SIMD heavy and has many memory accesses to shared data, but read-only – so here using all 8 cores is 90% faster than just the 4 big cores that are just 29% faster than LITTLE cores. It is possible some throttling is happening here though.
CPU Science Benchmark DNBODY (GFLOPS) double/FP64 339 [+91% vs 4C] 177 [+39% vs 4c] 127 With FP64 VFP code we see similar results, using all cores is much faster (91%) but big cores are 39% faster than LITTE ones.
With complex SIMD (Neon) FP32 code we see the power of the 4 big cores – over 2-2.6x faster than the little ones; using all 8 cores is actually slower, with the little cores just sucking bandwidth for nothing. Clearly more is not always better.
ARM big.LITTLE CPU MCore
CPU Multi-Core Benchmark Inter-Core Bandwidth (MB/s) 3994 (but ~500/core) 2254 (but ~563/core)
CPU Multi-Core Benchmark Inter-Core Latency (ns) 287 59
The bandwidth per core is similar whether bit or LITTLE – thus all 8 cores have higher total/aggregate bandwidth. However inter-core latency is much higher when all cores are used thus threads that share data should try to run on the same type of core – be tha big or LITTLE.

Despite the lack of algorithm optimisations – to deal with HMP and thus “asymmetric workload allocation” – it is good to see that using all 8 cores is *generally* faster. In honesty we’re a bit surprised. It seems that the OS scheduler does pretty well all by itself and most workloads should perform pretty well and things can “only get better” 😉

True, heavy SIMD (Neon) workloads allow the big cores to show their power – they are over 2-2.6x faster (more than just clock difference at 46%) – and here using just the big cores is faster between 17-25%. Some of this may be due to the additional stress on the memory sub-system which now has to feed 8 cores not 4, with the bandwidth per core decreasing.

The problem would be bigger in 64-bit ARMv8 mode where FP64 code could finally use SIMD/Neon – while crypto algorithms (AES, SHA) are hardware accelerated and thus bandwidth would count more than compute power. Again this is similar to what we saw in x86 World with HT systems.

However, again just relying on the scheduler to “wake” and migrate the threads on the big cores may be enough – no further scheduler may be needed.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Despite our reservations – the Exynos 5433 performs better thn expected – with all its 8-cores enabled/used, with the LITTLE Cortex A53 cores punching way above their weight. The big Cortex A57 cores do show their power especially in SIMD Neon code where they are far faster (2-2.6x). Unfortunately in ARMv7 32-bit mode they are a bit hamstrung – unable to use FP64 Neon code as well as crypto (AES, SHA) hardware acceleration – a pity. Here a 64-bit ARMv8 Android version would likely be much faster but does not look like it will happen.

ARM has assumed they don’t need “legacy” ARMv7 code and everybody would just move to ARMv8: thus none of the new instructions are available in ARMv7 mode. Unlike Apple, Android makers don’t seem to rush to adopt an 64-bit Android and while simple Java-only apps would naturally not care – the huge number of apps using native code (though the NDK) just won’t run (similar to other platforms like x86) – making compatibility a nightmare.

Apps/libraries that use dynamic/asynchronous workload allocators are likely to perform even better, but even legacy/simple parallel versions (using static/fixed workload allocators) work just fine.

The proviso is that heavy SIMD/Neon compute algorithms (e.g. Binomial, S/DGEMM, S/DFFT, etc.) use only 4 theads – hopefully migrated by the scheduler to the big cores – and thus perform better. Unfortunately there don’t seem to be APIs to determine whether the CPU is HMP – thus detection is difficult but not impossible. Using “hard affinity” aka scheduling threads on the “right” cores does not seem to be needed – unlike in x86-World where early OSes did not deal with HT properly – it seems the ARM-World has learned a few things.

In the end, whatever the OS – the Cortex A5x cores are performing well and perhaps one day will get the chance to perform even better, but don’t bet on it 😉

Mali T760 GPGPU (Exynos 5433 SoC): FP64 Champion – Adreno Killer?

Samsung Logo

What is Mali? What is Exynos?

“Mali” is the name of ARM’s own “standard” GPU cores that complement the standard CPU cores (Cortex). Many ARM CPU licensors integrate GPU cores from other vendors in their SoCs, e.g. Imagination, Vivante, Adreno rather than the default Mali.

Mali Series 700 is the 3-rd generation “Midgard” core that complement’s ARM’s 64-bit ArmV8 Cortex 5X designs and thus used in the very latest phones/tablets and has been updated to include support for new technologies like OpenCL ES, OpenGL ES and DirectX.

“Exynos” is the name of Samsung’s line of SoCs that is used in Samsung’s own phones/tablets/TVs. Series 5 is the 5-th generation SoC generally using ARM’s “big.LITTLE” architecture of “small” cores for low-power and “big” cores for performance. 5433 is the 1st 64-bit SoC from Samsung supporting AArch64 aka ArmV8 but running in “legacy” 32-bit ArmV7 mode.

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of various modern phones and tablets that support GPGPU.

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
Type / Micro-Arch VLIW4 (Midgard 3nd gen) VLIW5 VLIW5 VLIW5 VLIW4 (Midgard 2nd gen) Scalar (Maxwell 3rd gen) All except K1 are VLIW thus work best with vectorised data; some compilers are very good at vectorising simple code (e.g. by executing mutiple data items simultaneously), but the programmer can generally do a better job of extracting paralellism.
Core Speed (MHz) estimated 600 600 578 400 533 ? Core speeds are comparative with latest devices not pushing the clocks too high but instead improving the cores.
OpenGL ES Support 3.1 3.1 3.0 3.0 3.0 (should support 3.1) 3.1 Mali T7xx adds official support for OpenGL ES 3.1 just like the other modern GPU designs: Adreno 400 and K1. While Mali T6xx should also suppot 3.1 the drivers have not been updated for this “legacy” device.
OpenCL ES Support 1.2 (full) 1.2 (full) 1.1 1.1 1.1 (should support full) Not for Android, supports CUDA Mali T7xx adds support for OpenCL 1.2 but also “full profile” just like Adreno 420 – both supporting all the desktop features of OpenCL – thus any kernels developed for desktop/mobile GPUs can run pretty much unchanged.
CU / SP Units 8 / 256 4 / 128 4 / 128 4 / 64 8 / 64 1 / 192 Mali T760 has 2x the CU of T628 but they should also be far more powerful. Adreno 420 only relies on more powerful CUs over the 330/320; nVidia uses only 1 SMX/CU but more SPs.
Global Memory (MB) 2048 of 3072 1400 of 3072 1400 of 3072 960 of 2048 1024 of 3072 n/a Modern phones with 3GB memory seem to allow about 50% to be allocated through OpenCL. Mali does generally seem to allow more, typically 66%.
Largest Memory Block (MB) 512 of 2048 347 of 1400 347 of 1400 227 of 960 694 of 1024 n/a The maximum block size seems to be about 25% of total memory, but Mali’s driver allows as much as 50%.
Constant Memory (kB) 64 64 4 4 64 n/a Mali T600 was already fine here, with Adreno 400 needed to catch up to the rest. Previously constant data would have needed to be kept in normal global memory due to the small constant memory size.
Shared Memory (kB) 32 32 8 8 32 n/a Again Mali T600 was fine already – with Adreno 400 finally matching the rest.
Max. Workgroup Size 256 x 256 x 256 1024 x 1024 x 1024 512 x 512 x 512 256 x 256 x 256 256 x 256 x 256 n/a Surprisingly the work-group size remains at 256 for Mali T700/T600 with Adreno 400 pushing alll the way to 1024. That does not necessarily mean it is the optimum size.
Cache (Reported) kB 256 128 32 32 n/a n/a Here Mali T760 overtakes them all with a 256kB L2 cache, 2x bigger than Adreno 400 and older Mali T600.
FP16 / FP64 Support Yes / Yes Yes / No Yes / No Yes / No No No Here we are the 1st mobile FP64 native GPU! If you have double floating-point workloads then stop reading now and get a SoC with Mali T700 series.
Byte/Integer Width 16 / 4 1 / 1 1 / 1 1 / 1 16 / 4 n/a Adreno prefers non-vectorised integer data even though it is VLIW5; only Mali prefers vectorised data (vec4) similar to the old ATI/AMD pre-GCN hardware. At least all our vectorisations are not in vain 😉
Float/Double Width 4 / 2 1 / n/a 1 / n/a 1 / n/a 4 / n/a n/a As before, Adreno prefers non-vectorised while Mali vectorised data. As Mali T760 supports FP64, it also wants vectorised double floating-point data.

GPGPU Compute Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
5433 GPGPU Arithmetic
GPGPU Arithmetic Benchmark Half-float/FP36 Vectorised OpenCL (Mpix/s) 199.9 184.4 73.4 20.2
GPGPU Arithmetic Benchmark Single-Float/FP32 Vectorised OpenCL (Mpix/s) 105.9 182 [+71%] 114.8 49.1 20.2 Adreno 420 manages to beat Mali T760 by a good ~70% even though we use a highly vectorised kernel – not what we’ve expected. But this is 5x faster than the old Mali T625 showing just how much Mali has improved since its last version – but not enough!
GPGPU Arithmetic Benchmark Double-float/FP64 Vectorised OpenCL (Mpix/s) 30.6 [+3x] 10.1 (emulated) 8.4 (emulated) 3.4 (emulated) 0.518 (emulated) Here you see the power of native support, Mali T760 blows everything out the water – it is 3x faster than the Adreno 400 and a crazy 60x (sixty times) faster than the old Mali T625! This is really the GPGPU to beat – nVidia must be regretting not adding FP64 to K1.
GPGPU Arithmetic Benchmark Quad-float/FP128 Vectorised OpenCL (Mpix/s) 0.995 [+5x] (emulated using FP64) 0.197 (emulated) 0.056 (emulated) 0.053 (emulated) failed Emulating FP128 using FP64 gives Mali T760 a big advantage – now it is 5x faster than Adreno 400! There is no question – if you have high precision computation to do, Mali T700 is your GPGPU.
soc_5433_gp_crypt
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 136 145 [+6%] 96 70 85 T760 is just a bit (6%) slower than Adreno 420 here, but still good improvement (2x) over its older brother T628.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 200 131 94 89
soc_5433_gp_hash
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 321 [+2%] 314 141 106 89 In this integer compute-heavy workload, Mali T760 just edges Adreno 420 by 2% – pretty much a tie. Both GPGPUs are competitive in integer workloads as we’ve seen in the AES tests also.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 948 442 271 294
soc_5433_gp_fin
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 212.9 235.4 [+10%] 170.7 85 98.3 Black-Scholes is not compute heavy allowing many GPUs to do well, and here Adreno 420 is 10% faster than Mali T760.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 0.842 6.605 [+7x] 4.737 1.477 0.797 Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and Adreno 420 does not disappoint; however Mali T780 (just as T628) is not very happy with our code with a pitiful score that is 1/7x (seven times slower).
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 34 59.5 [+1.75x] 19.1 14.2 10.4 Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Adreno 420 is 1.75x (almost two times) faster. It could be that Mali stumbles at the shared memory operations which are key to both algorithms.
soc_5433_gp_sci
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 5.167 6.179 [+19%] 3.173 2.992 2.362 Adreno 420 continues its dominance here, being ~20% faster than Mali T760 but nowhere near the lead it had in Financial tests.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 1.902 5.470 [+2.87x] 3.535 2.146 1.914 FFT involves a lot of memory accesses but here Adreno 420 is almost 3x faster than Mali T760, a lead similar to what we saw in the complex Financial (Binomial/Monte-Carlo) tests.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 14.3 27.7 [+2x] 23.9 15.9 9.46 N-Body generally allows GPUs to “spread their wings” and here Adreno 420 does not disappoint – it is 2x faster than Mali T760.

It seems our early enthusiasm over FP64 native support was quickly extinguished: while Mali T760 naturally does well in FP64 tests – it cannot beat its rival Adreno (420) in other tests.

In single-precision floating-point (FP32) simple workloads, Adreno is only about 10-20% faster; however in complex workloads (Binomial, Monte-Carlo, FFT, GEMM) Adreno can be 2-7x (times) faster – a huge lead. It seems to do with shared memory accesses rather VLIW design needing highly-vectorised kernels which is what we’re using.

Naturally in double-precision floating-point (FP64) workloads, Mali T760 flies – being 3-5x (times) faster, so if those are the kinds of workloads you require – it is the natural choice. However, such precision is uncommon on phones/tablets – even desktop/laptop GPGPUs have crippled FP64 performance.

In integer workloads, the two GPGPUs are competitive with a 3-5% difference either way.

The relatively small (256) workgroup size may also hamper performance with Adreno 420 able to keep more (1024) threads in flight – although the shared cache size is the same.

GPGPU Memory Performance

We are testing memory bandwidth performance of GPUs using OpenCL, including transfer (up/down) to/from system memory; we also measure the latencies of the various memory types (global, constant, shared, etc.) using different access patterns (in-page random access, sequential access, etc.).

Results Interpretation (Bandwidth): Higher values (MPix/s, MB/s, etc.) mean better performance.

Results Interpretation (Latency): Lower values (ns, clocks, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
Memory Configuration 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 2GB DDR3 (shared with CPU) 3GB DDR2 (shared with CPU) Modern phones and tablets now ship with 3GB – close to the 32-bit address limit. While all SoCs suppport unified memory, neither seem to support “zero copy” or HSA which has recently made it to OpenCL on the desktop with version 2.0. Future SoCs will fix this issue and provide global virtual memory address space that CPU & GPU can share.
soc_5433_gp_mbw
GPGPU Memory Bandwidth Internal Memory Bandwidth (MB/s) 4528 9457 [+2x] 8383 4751 1436 Qualcomm’s SoC manages to extract almost 2x more bandwidth compared to Samsung’s SoC – which may expain some of the performance issues we saw when processing large amounts of data. Adreno has almost 10GB/s bandwidth to play with, similar to single-channel desktop/laptops!
GPGPU Memory Bandwidth Upload Bandwidth (MB/s) 2095 3558 [+69%] 3294 2591 601 Adreno wins again with 70% more upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (MB/s) 2091 3689 [+76%] 2990 2412 691 Again Adreno has over 70% more download bandwidth – it is no wonder it did so well in the compute tests. Mali will have to improve markedly to match.
soc_5433_gp_mlat
GPGPU Memory Latency Global Memory Latency – In Page Random Access (ns) 199.3 [-17%] 239.6 388.2 658.8 625.8 It starts well for Mali T760, with ~20% lower response time over Adreno 420 and almost 3x faster than its old T628 brother – which no doubt helps the performance of many algorithms that access global memory randomly. All modern SoC (805, 5433, 801) show just how much improvement was made in the memory prefetchers in the last few years.
GPGPU Memory Latency Global Memory Latency – Full Random Access (ns) 346.5 [-30%] 493.2 500.2 933.7 815.2 Full random access is tough on all GPUs and here, but Mali T760 manages to be 30% faster than Adreno 420 which has not improved over 330 (same memory controller in 800 series).
GPGPU Memory Latency Global Memory Latency – Sequential Access (ns) 147.2 106.6 [-27%] 98.2 116.2 280.6 With sequential accesses – we finally see Adreno 420 (but also the older 300 series) show their prowess being 27% faster. Qualcomm’s memory prefetchers seem to be doing their job here.
GPGPU Memory Latency Constant Memory (ns) 263.1 70.7 [-1/3.75x] 74.5 103 343 Here Adreno 420’s constant memory has almost 4x lower latency than Mali (thus 1/4 response time) – which may be a clue as to why it is so much faster. Basically it does not seem that constant memory is cached on the Mali but just normal global memory.
GPGPU Memory Latency Shared Memory (ns) 301 30.1 [-1/10x] 47 83 329 Shared memory even more important as it is used to share data between threads – lack of it reduces the work-group sizes that can be used. Here we see Adreno 420’s shared memory having 10x lower latency than Mali (thus 1/10x response time) – no wonder Mali T760 is so much slower in complex workloads that make extensive use of shared memory. Basically shared memory is not *dedicated* but just normal global memory.

Memory testing seems to reveal Mali’s T760 problem: its bandwidth is much lower than Adreno while its key memories (shared, constant) latencies are far higher. It is a wonder how it performs so well actually if the numbers are to be believed – but since Mali T628 scores similarly there is no reason to doubt them.

Adreno T420 has 2x higher internal bandwidth and over 70% more upload/download bandwidth – and since neither supports HSA and thus “zero copy” – it will be much faster the bigger the memory blocks used. Here, Qualcomm’s completely-designed SoC (CPU, GPU, memory controller) pays dividends.

Mali T760’s global memory latency is lower but neither constant nor (more crucially) shared memory seem to be treated differently and thus have similar latencies to global memory; common GPGPU optimisations are thus useless and any commplex algorithm making extensive use of shared memory will be greatly bogged down. ARM should better re-think their approach for the new (T800) Mali series.

Video Shader Performance

We are testing vectorised shader compute performance of the GPUs in OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
soc_5433_gp_vid_aa
Video Shader Benchmark Single-Float/FP32 OpenGL ES (Mpix/s) 49 170.5 [+3.5x%] 114.4 60 33.6 124.7 Finally the K1 can play and does very well but cannot overtake Adreno 420 which also blows the Mali T760 out of the water being over 3.5x faster. We can see just how much shader performance has improved in a few years.
Video Shader Benchmark Half-float/FP16 OpenGL ES (Mpix/s) 54 219.6 [+4x] 115 107.6 32.2 124.4 While Mali T760 finally supports FP16, it does not seem to do much good over FP32 (+10% faster) while Adreno 420 benefits greatly – thus increases its lead to being 4x faster. OpenGL is still not Mali’s forte.
Video Shader Benchmark Double-float/FP64 OpenGL ES (Mpix/s) 2.3 [1/21x] (emulated) 10.6 [+5x] [1/17x] (emulated) 9.4 [1/12x] (emulated) 4.0 [1/15x] (emulated) 2.1 [1/16x] (emulated) 26.0 [1/4.8x] (emulated) While Mali T760 does support FP64, the OpenGL extension is not yet supported/enabled – thus it is forced to run it emulated in FP32. This allows Adreno 420 to be 5x faster – though nVidia’s K1 takes the win.
Video Shader Benchmark Quad-float/FP128 OpenGLES (Mpix/s) n/a n/a n/a n/a n/a n/a Emulating FP128 using FP32 is too much for our GPUs, we shall have to wait for the new generation of mobile GPUs.

Using OpenGL ES allows the K1 to play, but more specifically it shows Mali’s OpenGL prowess is lacking – Adreno 420 is between 4-5x faster – a big difference. FP16 support seems to make no difference while FP64 support is missing in OpenGL thus it cannot play its Ace card. ARM has some OpenGL driver optimisation to make.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mali T760 is a big upgrade over its older T600 series though a lot of the details have not changed. However, it is not enough to beat its rival Adreno 400 series – with native FP64 performance (and thus FP128 emulated) being the only shining example. While its integer workload performance is competitive – floating-point performance in complex workloads (making extensive use of shared memory) is much lower. Even highly vectorised kernels that should help its VLIW design cannot close the gap.

It seems the SoC’s memory controller lets it down, and its non-dedicated shared and constant memory means high latencies slow it down. ARM should really implement dedicated shared memory in the next major version.

Elsewhere its OpenGL shader compute performance is even slower (1/4x Adreno) with FP16 support not helping much and FP64 native support missing. This is a surprise considering its far more competive OpenCL performance. Hopefully future drivers will address this – but considering the T600 performance has remained pretty much unchanged we’re not hopeful.

To see how the Exynos 5433 CPU fares, please see Exynos 5433 CPU (Cortex A57+A53) performance article!

Exynos 5433 64-bit Cortex A57/A53: Krait Killer?

Samsung Logo

What is Exynos?

“Exynos” is the family of mobile SoCs from Samsung; the CPU cores in the modern versions are Qualcomm’s own “Krait” (though some are standard ARM Cortex cores) while the (GP)GPU core is Qualcomm’s own “Adreno” – unlike competing ARM SoCs which generally contain standard ARM CPU and GPU designs.

There are various series, with series 800 (Prime) representing the top of the range, with lower numbered series (e.g. 600, 400, 200, etc.) representing lower performance. Within the same series higher numbers (e.g. 805, 801, 800, etc.) represent newer generation and generally better performance and more features.

The CPU cores are called “Krait” and are Qualcomm’s own design under ARM licence – they are not standard ARM Cortex cores. The latest 400 series shares many features to the Cortex A15 – though some features are similar to the older Cortex A8/A9.

In this article we test CPU core (Krait) performance; please see our other articles on:

Hardware Specifications

We are comparing the internal CPU cores of various modern SoCs in the latest phones and tablets.

SoC Specifications Samsung Exynos 5433 / Samsung Galaxy Note 4C Qualcomm Snapdragon 805 / Samsung Galaxy Note 4F Qualcomm Snapdragon 801 / Sony XPeria Z3 Qualcomm Snapdragon 600 / Samsung Galaxy S4 LTE Samsung Exynos 5420 / Samsung Note 10 – 2014 Edition Comments
CPU Arch / ARM Arch Cortex A57+A53 ARMv8-A Krait 450 (APQ8084) ARMv7-A Krait 400 (MSM8974-AC) ARMv7-A Krait 300 (MSM8960) ARMv7-A Cortex A15+A7 ARMv7-A While the Cortex A5x series are 64-bit, the OS of Note 4 runs in 32-bit mode, thus ARMv7 normal code. It is unclear whether there will ever be a 64-bit version for this phone.
Cores (CU) / Threads (SP) 4C + 4c / 8 threads simultaneously 4C / 4 threads 4C / 4 threads 4C / 4 threads 4C + 4c / 4 threads Except Exynos which is big.LITTLE and has 4 big and 4 little cores, all other CPUs are quad-core. However, the Exynos 5433 can actually run 8 threads at the same time vs. 4 threads for the other CPUs including the older 5420.
Speed (Min / Max / Turbo) (MHz) 400-1900 (400-1300 / 700-1900) 300-2650 300-2466 384-1890 250-1900 (500-1300 / 600-1900) We see Krait 400 pushing close to 3GHz while Cortex designs hover around 2GHz, thus relying on compute power
L0D / L0I Caches (kB) n/a 4x 4kB 4x 4kB 4x 4kB n/a All Kraits have very small L0 caches while Cortex is a more traditional design.
L1D / L1I Caches (kB) 2x 4x 32kB 4x 16kB 4x 16kB 4x 16kB 2x 4x 32kB Cortex has 2x larger L1 caches than Krait but supposedly a bit slower.
L2 Caches (MB) 2MB 2MB 2MB 2MB 2MB All designs have the same size L2 cache.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (Neon2, Neon, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Native Benchmarks Samsung Exynos 5433 / Cortex A57+A53 Qualcomm Snapdragon 805 / Krait 450 Qualcomm Snapdragon 801 / Krait 400 Qualcomm Snapdragon 600 / Krait 300 Samsung Exynos 5420 / Cortex A15+A7 Comments
Exynos 5433 CPU Arithmetic
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 17.07 17.24 [+1%] 14.7 10.3 14.5 Here both 5433 and 805 are neck-and-neck within 1% difference. Despite its much higher clock (+40%) the Krait 450 just keeps up with the latest Cortex A57.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 90 [+7%] 84 74 62 92 5433 flexes its FP muscles here, being 7% faster than 805 despite the much higher clock. While double-precision floating-point workloads are uncommon on mobile/tablets, its use is increasing as more complex apps are ported.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 162 [+3%] 157 136 108 73 With FP64 VFP code, the 5433 is only 3% faster.
Despite its very high clock (+40%), both CPUs are pretty much within 3-7% of each other. Naturally 5433 also supports ARMv8 64-bit but is forced to run in legacy ARMv7 mode.
Exynos 5433 CPU Multi-Media
CPU Multi-Media Vectorised SIMD Native Integer (Int32) Multi-Media (Mpix/s) 22.48 Neon [+45%] 15.5 Neon 12.4 Neon 10.7 Neon 15.1 Neon Krait never seemed to do very well with SIMD (Neon) code and here we see 5433 being 45% faster than 805, the largest we’ve seen so far. ARM has really improved SIMD performance in modern Cortex cores with A15 already handily beating Krait designs. Qualcomm needs to overhaul the SIMD units to remain competitive.
CPU Multi-Media Vectorised SIMD Native Long (Int64) Multi-Media (Mpix/s) 4.3 Neon [+67%] 2.57 Neon 2.19 Neon 1.86 Neon 2.47 Neon With 64-bit Neon workload we see 5433 pull ahead, 67% faster than the 805! For integer SIMD multi-media code, Cortex is the core to beat!
CPU Multi-Media Vectorised SIMD Native Quad-Int (Int128) Multi-Media (kpix/s) 932 [+27%] 730 681 520 577 With normal int64 code, the 5433 still leads but that lead falls to 27%. It woud naturally do better in 64-bit mode if it were running an 64-bit OS.
CPU Multi-Media Vectorised SIMD Native Float/FP32 Multi-Media (Mpix/s) 20.2 Neon [+25%] 16.2 Neon 14.13 Neon 10.45 Neon 13.57 Neon Switching to floating-point Neon SIMD code, the 5433 is still 25% faster over 805.
CPU Multi-Media Vectorised SIMD Native Double/FP64 Multi-Media (Mpix/s) 7.59 [+31%] 5.78 4.6 3.89 4.16 Switching to FP64 VFP code (Neon does support FP64 in ARMv8), 5433 is still 31% faster than 805.
CPU Multi-Media Vectorised SIMD Native Quad-Float/FP128 Multi-Media (kpix/s) 301 [=] 295 257 184 190 In this heavy algorithm using FP64 to mantissa extend FP128, we finally have the 5433 slowing down just matching the 805. Cortex’s power is realised with SIMD code.
With highly-optimised Neon SIMD code, the Cortex A57 that powers 5433 makes mince-meat out of the 805’s Krait being between 25-67% faster despite the much lower clock speed. Qualcomm really needs to improve those SIMD units or risk being badly left behind. Naturally if the 5433 were running in ARMv8 64-bit mode the difference would be much higher.
Exynos 5433 CPU Crypto
GPGPU Crypto Benchmark Crypto SHA2-512 (MB/s) 118 Neon [+2.26x%] 52 Neon 45 Neon 32 Neon 67 Neon Starting with this tough 64-bit Neon SIMD accelerated hashing algorithm, 5433 again flexes its SIMD muscles beating the 805 over 2x (2.26x faster). It shows just how much better modern Cortex cores are executing SIMD code.
GPGPU Crypto Benchmark Crypto AES-256 (MB/s) 136 147 [+8%] 130 90 146 In this non-SIMD workload, the 805 manages to be 8% faster – a surprising result. While the 5433 does support AES HWA that is only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto SHA2-256 (MB/s) 332 Neon [+46%] 227 Neon 225 Neon 148 Neon 186 Neon Switching to a 32-bit Neon SIMD, 5433 is on top again beating the 805 by 46%. Again, Cortex A5x does support SHA HWA but only in ARMv8 mode.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 216 [+30%] 166 145 109 165 Less rounds do seem to make a bit of a difference with 5433 now winning by 30% over the 805.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 362 Neon [+14%] 315 Neon 250 Neon 213 Neon 297 Neon SHA1 is the “lightest” compute workload and here 5433 is only 14% faster.
Again in SIMD Neon code the 5433 shows its power, beating the 805 between 14-126% similar to what we saw in the Mandelbrot tests. Naturally 5433 also supports both AES and SHA HWA but only in ARMv8 mode which needs a 64-bit OS. Here x86 does better as all instruction sets are available in both x86 and x64 unlike ARM who conveniently seems to forget about the 32-bit world.
Exynos 5433 CPU Financial
CPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11.79 [+42%] 8.29 5.36 5.4 6.12 As this algorithm does not use SIMD, the 5433 still manages to handily beat the 805 by 42%.
CPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 6.11 [+47%] 4.14 2.83 3.24 3.28 Switching over to FP64 code, the 5433 still manages to be 47% faster – the 805 just cannot get a break!
CPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1.26 2.03 [+61%] 1.76 1.29 1.98 Binomial uses thread shared data thus stresses the cache & memory system; here finally we see the 805 pull ahead by 61%, a big win considering past results.
CPU Finance Benchmark Binomial double/FP64 (kOPT/s) 1.26 1.85 [+46%] 1.71 1.53 2.41 Switching to FP64 code the 805 still wins but by just 46%. It seems this is the kind of algorithm it prefers.
CPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 2.51 [+2x] 1.26 1.42 1.04 1.14 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; the fortunes are reversed again as 5433 is now 2x (twice) as fast as the 805.
CPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1.87 [+2.08x] 0.897 0.711 0.883 1.23 And finall FP64 code does not make any difference, again the 5433 is 2x as fast.
The financial tests generally favour the 5433 which is between 40-100% faster than the 805, except in the “tough” binomial test where the 805 is between 40-60% faster. Even in VFP code the Cortex A5X is the core to beat!
Exynos 5433 CPU Science
CPU Science Benchmark SGEMM (MFLOPS) float/FP32 3906 Neon [+9%] 3579 Neon 3644 Neon 2626 Neon 4889 Neon In this complex Neon SIMD workload we would expect the 5433 to lead, and it does but only by 9%. It seems again that memory accesses slow it down and some of the 8 threads may be starving.
CPU Science Benchmark DGEMM (MFLOPS) double/FP64 1454 [+2.05x] 707 797 547 531 Neon does not support FP64 thus all CPUs use VFP code; here 5433 shows its power being over 2x (twice) faster than the 805.
CPU Science Benchmark SFFT (GFLOPS) float/FP32 720 Neon 989 Neon [+37%] 919 Neon 620 Neon 708 Neon FFT also uses SIMD and thus Neon but stresses the memory sub-system more: as we saw in Binomial, the 805 in the lead by 37%.
CPU Science Benchmark DFFT (GFLOPS) double/FP64 457 586 [+28%] 550 401 399 With FP64 VFP code, the 805 still leads by 28%. It seems the memory sub-system of the 5433 lets it down.
CPU Science Benchmark SNBODY (GFLOPS) float/FP32 758 Neon [+80%] 420 Neon 331 Neon 342 Neon 465 Neon N-Body simulation is SIMD heavy but has many memory accesses to shared data, but read-only – allows the 5433 to win again by 80%. It seems read-only data is not a problem, but read/modify/write is.
CPU Science Benchmark DNBODY (GFLOPS) double/FP64 339 [+46%] 232 183 145 199 With FP64 VFP code see the 5433 still winning but by just 46%.
The results mirror what we saw in the Financial tests: whenever thread-shared memory is used that is read/modified/written – the 5433 slows down, no doubt the extra 4 threads don’t help matters and likely slow it down.
Exynos 5433 CPU Multi-Core
CPU Multi-Core Benchmark Inter-Core Bandwidth (MB/s) 3994 [+10%] (but ~500/core) 3599 (but ~899/core) 2950 (but ~737/core) 2133 (but ~533/core) 1349 (but ~337/core) One thing that Qualcomm does very well is memory performance, both CPU and GPU-wise. But here 5433 has 4 more cores which helps it muscle out its rival with 10% more bandwidth. But while it technically wins, the bandwidth per core is just ~500MB/s while 805 has ~899MB/s, almost 2x more bandwidth. We see how all these caches perform in the Snapdragon 805 Cache and Memory performance article.
CPU Multi-Core Benchmark Inter-Core Latency (ns) 287 121 [-57%] 118 162 128 Latency, however, is much higher – or in other words 805 is 57% faster. It will be interesting to see whether this is due to different core transfer (e.g. big-2-LITTLE) or even between the same type (big-2-big / LITTLE-2-LITTLE).

The 5433 with its modern Cortex A5X design as well as 8-theads walks all over the 805 despite being clocked much lower – especially in SIMD (Neon) tests it is up to 2x (twice) as fast. Only in algorithms that make extensive use of shared thread data and read/modify/write it – the 805 catches a break and is faster.

It will be interesting to see whether the extra 4 threads (aka little cores) just get in the way in these tests and put too much strain on the memory system; effectively it may be better to use just 4 threads (aka BIG cores). We will investigate this in a future article.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java which is what Android and its apps are running. While key compute code will naturally be native, the rest of the code will naturally run on the JVM.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

JVM Benchmarks Samsung Exynos 5433 / Cortex A57+A53 Qualcomm Snapdragon 805 / Krait 450 Qualcomm Snapdragon 801 / Krait 400 Qualcomm Snapdragon 600 / Krait 300 Samsung Exynos 5420 / Cortex A15+A7 Comments
Exynos 5433 Java Arithmetic
Java Arithmetic Java Dhrystone (GIPS) 13.4 [+56%] 8.55 3.85 4.44 3.03 Unlike native Dhrystone (where we saw a minor delta), the Java version is undoubtely faster on the 5433 by over 50% in a clear win. We expected the Krait to do better in integer workloads.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 88 [+76%] 50 37 30 22 For FP64 JVM code, the 5433 is now 76% faster. While there may not be many FP64 Java workloads, the performance is there if you need it.
Java Arithmetic Java Whetstone final/FP32 (GFLOPS) 106 [+2.2x] 48 45 41 28 Switching to single-precision floating-point code, 5433 is even faster – over 2x 805.
The 5433 advantage increases with every test, be it integer, floating-point – crushing the 805.
Exynos 5433 Java Vectorised
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 4668 [+48%] 3147 3803 2201 2675 While vectorised code would normally be native, there may be apps using normal Java code and here the 5433 is almost 50% faster than 805 which somehow ends up slower than its older 801 brother. We put this down to JVM/Android differences (5.0.1 vs. 5.0.2).
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 2803 [+46%] 1917 1141 1168 With 64-bit integer vectorised workload, we see a similar delta of 46% in 5433’s favour.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 6053 [+88%] 3224 2587 1806 Switching to single-precision (FP32) floating-point code, the delta increases again to 88% – 5433 is almost 2x as fast!
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 6003 [+86%] 3219 2626 1636 We see the same thing here, with the 5433 enjoying a 86% advantage.
Vectorised Java code perfoms similar to non-vectorised Java in the previous test.

While native code showed some surprises, here the 5433 is the undisputed champion – beating the 805 in all tests by a wide margin of 50% to over 100% (2x as fast). For pure Java apps the 5433 should feel a lot faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is not really a surprise that the latest ARMv8 64-bit 8-core Cortex A57+A53 (albeit running in 32-bit ARMv7 mode) in Exynos 5433 would dominate the ageing Krait 400-series core in Snapdragon 805 – but the latter’s 40% higher clock could have thrown a few “wobblies”.

Unlike earlier big.LITTLE designs, all 8-cores can be used simultaneously – but this may actually present a problem when using static work allocators as the “big” cores may wait for the “LITTLE” cores to finish – in effect having 8 little cores. We will be exploring the differences in performance when using just the “big” cores, just the “LITTLE” cores or all in a future article.

It is naturally a pity that the 5433 does not use a 64-bit Android version and thus benefit from all the ARMv8 improvements, not to mention new instruction sets like AES HWA, SHA HWA, FP64 Neon and so on. It seems that Samsung (like other vendors we may add) may never actually release a 64-bit OS/ROM for it – and thus the 5433 like other Cortex A5x SoCs are destined to run 32-bit for their whole life… Without 64-bit binary drivers there may not be a way for 3-rd party developers (modders?) to make a 64-bit OS either…

However, even under these circumstances the Note 4-powered Exynos is the most powerful Note (CPU-wise) – though the roles seem to be reversed when comparing the GPUs as we saw in the previous article Exynos 5433 (Mali) GPGPU performance. Thus the decision as to which Note 4 to choose is more difficult – do you want CPU or GPU power? As lots of compute tasks are moving to GPGPU (even on tablet/phones) – we would lean towards GPU prowess… Don’t forget to consider memory performance which we’re invesigating in the next article Exynos 5433 Cache and Memory performance.