AMD A4 “Mullins” APU CPU: Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • Crypto: Random number generator.
  • Security: Platform Security Processor (PSP) included in the SoC (ARM based).

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing Mullins with its predecessor (Kabini) as well as its competition from Intel.

APU Specifications Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
Cores (CU) / Threads (SP) 4C / 4T 4C / 4T 2C / 4T 4C / 4T We still have 4 cores and 4 threads just like Atom (old and new) – only Core M has 2 cores with HT – we shall see whether this makes a big difference.
Speed (Min / Max / Turbo) 480-1600-2400 (6x-20x-30x) 500-800-2000 (5x-8x-20x) 1000-1500 (10x-15x) 1000-1800-2400 (10x-18x-24x) Mullins is clocked a bit higher (1.8GHz vs. 1.5 – 20% faster) but also supports Turbo (up to 2.4GHz – up to 60% faster) which should give it a big advantage over old Kabini. Both Atom and Core M also depend on opportunistic Turbo for most of their performance. As Mullins/Kabini are 15W rated, they should be able to Turbo higher and for longer – at least in theory.
Power (TDP) 2.4W 4.5W 15W 15W [=] TDP remains the same at 15W which is a bit disappointing considering the new Atom is 2.4-4W rated, we’re taling between 3-5x (five times) more power!
L1D / L1I Caches 4x 24kB 6-way / 4x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 2-way 4x 32kB 8-way / 4x 32kB 2-way No change in L1 caches which pretty much Atom, comparatively Core M has half the caches.
L2 Caches 2x 1MB 16-way 4MB 16-way 2MB 16-way 2MB 16-way No change in L2 cache; here Core M has twice as much cache – same size as a normal i7 ULV. It’s a pity AMD was not able to increase the caches.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Haswell introduces AVX2 which allows 256-bit integer SIMD (AVX only allowed 128-bit) and FMA3 – finally bringing “fused-multiply-add” for Intel CPUs. We are now seeing CPUs getting as wide a GP(GPUs) – not far from AVX3 512-bit in “Phi”.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
Arithmetic Native
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 35.91 SSE4 [+20%] 31.84 AVX2 25.12 SSE4 28.77 SSE4 [+18%] Mullins like Kabini has no AVX2, but is still 18% faster than it (clocked 20% faster), unfortunately the new Atom manages to beat it. Not the best of starts.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 18.43 AVX [+13%] 21.56 AVX/FMA 13.55 AVX 16.2 AVX [+19%] Mullins has no FMA either, so is again 19% faster than Kabini – it shows the ALU and FPUs are unchanged. Again Atom manages to be faster.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 12.34 AVX [+23%] 13.49 AVX/FMA 8.44 AVX 10 AVX [+18%] With FP64 we see the same 18% difference – and Atom is still 20% faster.
We only see a 18-19% improvement in Mullins – in line with clock speed (+20%) with Turbo not doing much. The new CherryTrail Atom is thus 14-21% faster than it, not something you expect considering the hugely different TDP. Time does not stand still and Mullins is outclassed here.
SIMD Native
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 48.7 AVX [-20%] 70.8 AVX2 58.37 60.76 [+4%] Without AVX2, Mullins can only manage a paltry 4% improvement over Kabini – the only “silver lining” is that Atom is 20% slower than it – unlike what we saw before. Naturally Core M with AVX2 runs away with it.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 14.5 AVX [-38%] 24.5 AVX2 21.92 AVX 23.26 AVX [+6%] With a 64-bit integer workload, the improvement increases to 6%, less than clock difference – but thankfully Atom is much slower (by half) – naturally Core M is the winner.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 0.512 [+75%] 0.382 0.246 0.292 [+18%] This is a tough test using Long integers to emulate Int128, but here we see the full 18% improvement over Kabini – but now Atom is a huge 75% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 41.5 AVX [-4%] 61.3 FMA 36.91 AVX 43 AVX [+16%] In this floating-point AVX/FMA algorithm, Mullins returns to being 16% faster and a whisker faster than Atom (4%). With FMA, Core M is almost 50% faster still.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 15.9 AVX [-31%] 36.48 FMA 19.57 AVX 22.9 AVX [+17%] Switching to FP64 code, we see a 17% improvement for Mullins which allows it to be 30% faster than Atom.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 0.81 AVX [-37%] 1.69 FMA 1.27 AVX 1.27 AVX [=] In this heavy algorithm using FP64 to mantissa extend FP128, we see no improvement whatsoever – at least Atom is 37% slower; and yes, Core M is faster still.
Lack of AVX2/FMA and Turbo that does not seem to engage makes Mullins stuggle to be more than 16-18% faster than Kaveri – but thankfully it can beat its Atom rival sometimes by a good 30% amount. All in all, it does better with SIMD code than we’ve seen elsewhere though without any core changes it is showing its age…
Cryptography Native
BenchCrypt Crypto AES-256 (GB/s) 1.44 AES HWA [-55%] 2.59 AES HWA 3 AES HWA 3.14 AES HWA [+5%] All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – here Mullins is 5% faster than Kabini and 2x (twice) as fast as Atom, it even overtakes Core M with its dual-channel controller!
BenchCrypt Crypto AES-128 (GB/s) 2 AES HWA [-40%] ? AES HWA 3 AES HWA 3.3 AES HWA [+10%] What we saw with AES-256 was no fluke: less rounds do make some difference, Mullins is now 10% faster and 65% faster than Atom!
BenchCrypt Crypto SHA2-256 (GB/s) 0.572 AVX [-24%] 0.93 AVX2 0.708 AVX 0.659 AVX [-7%] In this tough AVX compute test, Mullins is unexpectedly 7% slower than the old Kabini – but Atom is still slower. But with SHA HWA in the next Atom, AMD will quickly be at a big disadvantage…
BenchCrypt Crypto SHA1 (GB/s) 1.17 AVX [-23%] ? AVX2 1.42 AVX 1.34 AVX [-6%] With a less complex algorithm – we still see Mullins 6% slower than Kabini – and again Atom is slower.
BenchCrypt Crypto SHA2-512 (GB/s) ? AVX ? AVX2 0.511 AVX 0.477 AVX [-7%] By using 64-bit integers this is pretty much the most complex hashing algorithm and thus tough for all CPUs – and here we see Mullins 7% slower again.
Mullins misses both AVX2 and the forthcoming SHA HWA – but manages to extract more memory bandwidth and thus is 5-10% faster than Kabini, and also much faster than Atom. Somehow it manages to be slower with hashing whatever algorithm – but remains much faster than Atom.
Financial Analysis Native
BenchFinance Black-Scholes float/FP32 (MOPT/s) 14.15 [-26%] 21.88 17.55 18.87 [+7%] In this non-SIMD test we start with a 7% improvement over Kabini, good but less than we expected – but at least faster than Atom.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 11.66 [-27%] 17.6 13.43 15.82 [+18%] Switching to FP64 code, we see a good 18% improvement – and victory over Atom again.
BenchFinance Binomial float/FP32 (kOPT/s) 3.85 [-22%] 5.1 1.31 4.33 [+3.3x] Binomial uses thread shared data thus stresses the cache & memory system; Mullins has improved by a huge 3.3x (over three times) – and a great win over Atom (but not Core M). It seems the new memory improvements do help a lot.
BenchFinance Binomial double/FP64 (kOPT/s) 4.13 [-31%] 5.02 2.66 5.91 [+2.2x] With FP64 code, Mullins is “only” 2.2x (over two times) faster than Kabini – and this again gives it a big win over its Atom competition – as well as over Core M!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 2.67 [-29%] 3.73 3.23 3.71 [+15%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; here Mullins is just 15% faster but it’s enough to tie with Core M and leave Atom in the dust.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 2.43 [-30%] 3.0 2.67 3.43 [+28%] Switching to FP64 we see a big 28% improvement – Mullins manages to beat both Atom and Core M by a good measure.
Somehow Mullins managed to redeem itself – beating Atom in all tests and even Core M in some of the tests. Running financial tests on a Mullins tablet should work better than on an Atom or Core M one.
Scientific Analysis Native
BenchScience SGEMM (GFLOPS) float/FP32 12.42 AVX [-7%] 13.99 FMA 11.43 AVX 13.22 AVX [+16%] In this tough SIMD algorithm, Mullins sees a good 16% improvement beating Atom and getting within a whisker to Core M – even without FMA.
BenchScience DGEMM (GFLOPS) double/FP64 6.09 AVX [-13%] 9.61 FMA 6.62 AVX 6.93 AVX [+5%] With FP64 SIMD code, Mullins is just 5% faster – but it’s enough to beat Atom – though not Core M. Still, a good improvement.
BenchScience SFFT (GFLOPS) float/FP32 5.17 AVX [+72%] 4.11 FMA 2.91 AVX 3 AVX [+3%] FFT also uses SIMD and thus AVX but stresses the memory sub-system more: here Mullins sees only a 3% improvement, not enough to beat Atom which is over 70% faster still. Mullins has its limits.
BenchScience DFFT (GFLOPS) double/FP64 2.83 AVX [+52%] 3.66 FMA 1.54 AVX 1.85 AVX [+20%] With FP64 code, Mullins improves by a large 20% – but again not enough to beat Atom which is now over 50% faster still.
BenchScience SNBODY (GFLOPS) float/FP32 1.58 AVX [-31%] 2.64 FMA 2.1 AVX 2.27 AVX [+8%] N-Body simulation is SIMD heavy but many memory accesses to shared data so Mullins is just 8% faster than Kaveri, but enough to beat Atom (by 43%). Unlike FFT, N-Body again agrees with Mullins/Kabini.
BenchScience DNBODY (GFLOPS) double/FP64 1.76 AVX [-30%] 3.71 FMA 2.09 AVX 2.51 AVX [+20%] With FP64 code Mullins improves again by 20% – but now more than enough to beat Atom (by 42%) – though not enough to beat Core M.
With highly optimised SIMD AVX code, Mullins sees a 5-20% improvement – which allows it to beat Atom in most tests – a good result from the rout we saw before.
Inter-Core Native
BenchMultiCore Inter-Core Bandwidth (GB/s) 1.69 [-52%] 8.5 3.0 3.47 [+15%] With unchanged L1/L2 caches Mullins relies on its higher rated speed – and 15% is a good improvement over Kabini. It’s got 2x (two times) the bandwidth Atom manages to muster – but way below Core M’s which has over 2.4x more still. We see how all these caches perform in the Cache & Memory AMD A4 “Mullins” performance article.
BenchMultiCore Inter-Core Latency (ns) 179 [1/3.5x] 76 66 31 [-33%] Latency, however, sees a massive 33% decrease, more than we’d expect – and surprisingly is way lower than Atom (1/3.5x) and even lower than Core M (1/2x).

While it does not bring any new instruction sets (AVX2, FMA, SHA HWA) and Turbo that does not seem to engage, Mullins’s 20% clock improvement does show and brings a corresponding 5-19% increase in performance over Kabini in most tests.

Against Atom, the scores are all over the place, sometimes Atom (CherryTrail) is 20-70% faster, other times Mullins is 20-55% faster. If they were rated the same TDP-wise that would be a good result – but as Mullins is rated 15W vs. 2.6-4W that’s not really power efficient. Core M is invariably faster than either in just about all tests.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 8.x/10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64 SP1, latest Intel drivers. .Net 4.5.x, Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
.Net Arithmetic
BenchDotNetAA .Net Dhrystone (GIPS) 3.85 [-33%] 2.95 4.0 5.78 [+44%] .Net CLR performance improves by a huge 44% – a great start and in line to rated clock increase and enough to beat the Atom by 33%.
BenchDotNetAA .Net Whetstone final/FP32 (GFLOPS) 9 [-3%] 10.8 12.49 9.24 [-23%] Floating-Point CLR performance takes a 23% hit over Kabini – but thankfully just a bit faster than Atom (by just 3%). Something in the new CLR does not agree with it.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 10 [-20%] 12.71 13.39 12.49 [-7%] FP64 CLR performance also sees a more modest 7% decrease a performance, but again still 20% faster than Atom.
With .Net we see a big variation from 23% lower to 44% higher performance than Kabini, but in all cases higher than Atom (between 3-33%). It is strange to see such a big variance, but the CLR changes may have something to do with it.
.Net Vectorised
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 11.34 [+7.8%] 11.25 7.12 10.52 [+47%] Just as we saw with Dhrystone, this integer workload sees a 47% improvement with Mullins – something in the CLR does agree with it – though not as much as Atom which is 8% faster still.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 6.36 [-12%] 5.62 1.78 7.2 [+4.04x] With 64-bit integer vectorised workload, we see a massive 4x (four times) improvement over Kabini – and 12% faster than Atom.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 2.6 [+25%] 4.14 2.01 2.08 [+3%] Switching to single-precision (FP32) floating-point code, we see only a minor 3% improvement – and here Atom is 25% faster still.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 6.25 [-12%] 9.36 5.97 7.1 [+19%] Switching to FP64 code, Mullins is 19% faster (in line with clock increase) and 12% faster than Atom. While unlikely compute tasks are written in .Net rather than native code, small compute code does benefit.
Vectorised .Net improves between 3-47% over Kabini (except the 64-bit integer “fluke”), and thus sometimes faster but sometimes slower than Atom.

We see a big variation here – unlike what we saw with native / SIMD code – likely due to CLR changes – but generally welcome. Against Atom we see an even larger variation – faster and slower, but overall competitive.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Despite no core changes, Mullins is helped by higher rated clock (and Turbo when that works) – which gives it a good 5-19% performance improvement in most tests over its Kabini predecessor. Any new instruction set support (AVX2, FMA, SHA HWA, etc.) will have to wait for the next model.

Unfortunately, time does not stand still – Atom has seen more core improvements (though no new instruction support either) – which means it is a far tougher competition than what Kabini had to deal with. While competitive performance wise with it – as Mullins is rated at 3-5x the power (15W vs. 2.6-4W) its performance efficiency is low. While it can be “powered down” to hit a lower TDP, performance will naturally suffer – so then it would no longer be competitive with Atom.

Previously, AMD APUs relied on their much more powerful GPUs (see AMD A4 “Mullins” APU GPGPU (Radeon R4) performance) to make up for lower CPU performance and power efficiency – but now the latest Intel APUs (be they Atom or Core) are very much competitive – thus their main advantage has gone.

The only advantage would be cost – assuming that Mullins would be much cheaper than even Atom, though that is difficult to see. Thus there is not much where Mullins would be the top choice.

We’ll have to wait for the next AMD APU model – though, again, time does not stand still – and future Atom/Core M models will bring brand-new goodies (DDR4, new instruction sets, etc.) which may well make even tougher opposition. We shall have to wait and see…

AMD A4 “Mullins” APU GPGPU (Radeon R4): Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • GPU: Core remains the same (GCN)

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding AMD GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of the new AMD APU with the old version as well as its competition from Intel.

Graphics Unit CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comment
Graphics Core B-GT EV8 B-GT2Y EV8 GCN GCN There is no change in GPU core in Mullins, it appears to be a re-brand fromm 83XX to R4. But time does not stand still, so while Kabini went against BayTrail’s “crippled” EV7 (IvyBridge) GPU – Mullins must battle the “beefed-up” brand-new EV8 (Broadwell) GPU. We shall see if the old GCN core is enough…
APU / Processor Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) The series has changed but not much else, not even the CPU core.
Cores (CU) / Shaders (SP) / Type 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) 2C / 128SP 2C / 128SP [=] We still have 2 GCN Compute Units but now they go against 16 EV8 units rather than 4 EV7 units. You can see just how much Intel has improved the Atom GPGPU from generation to generation while AMD has not. Will this cost them dearly?
Speed (Min / Max / Turbo) MHz 200 – 600 200 – 800 266 – 496 266 – 500 [=] Nope, clock has not changed either in Mullins.
Power (TDP) W 2.4 (under 4) 4.5 15 15 [=] As before, Intel’s designs have a crushing advantage over AMD’s: both Kabini and Mullins are rated at least 3x (three times) higher power than Core M and as much as 5-6x more than new Atom. Powered-down versions (6W?) would still consume more while performing worse.
DirectX / OpenGL / OpenCL Support 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 11.2 (12?) / 4.5 / 1.2 11.2 (12?) / 4.5 / 2.0 GCN supports DirectX 11.2 (not a big deal) and also OpenCL 4.5 (vs 4.3 on Intel but including Compute) and OpenCL 2.0 (same). All designs should benefit from Windows 10’s DirectX 12. So while AMD supports newer versions of standards there’s not much in it.
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) No / Yes No / Yes Sadly even AMD does not support FP16 (why?) but does support FP64 (double-float) in all interfaces – while Atom/Core GPU only in DirectX and OpenGL. Few people would elect to run heavy FP64 compute on these GPUs but it’s good to know it’s there…
Threads per CU 256 (256x256x256) 512 (512x512x512) 256 (256x256x256) 256 (256x256x256) GCN has traditionally not supported large number of threads-per-CPU (256) and here’s no different, with Intel’s GPU now supporting twice as many (512) – but whether this will make a difference remains to be seen.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (July 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: GPGPU Vectorised
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 160.5 181.7 165.3 163.9 [-1%] Straight off the bat, we see no change in score in Mullins; however, Atom has cought up – scoring within a whisker and Core M faster still (+13%). Not what AMD is used to seeing for sure.
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 160 180 165 163.9 [-1%] As FP16 is not supported by any of the GPUs, unsurprisingly the results don’t change.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 10.1 (emulated) 11.6 (emulated) 13.4 14 [+4%] We see a tiny 4% improvement in Mullins but due to native FP64 support it is almost 40% faster than both Intel GPUs.
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.08 (emulated) 1.32 (emulated) 0.731 (emulated) 0.763 [+4%] (emulated) No GPU supports FP128, but GCN can emulate it using FP64 while EV8 needs to use more complex FP32 maths. Again we see a 4% improvement in Mullins, but despite FP64 support both Intel GPUs are much faster. Sometimes FP64/FP32 ratio is so low that it’s not worth using FP64 and emulation can be faster (e.g nVidia).
AMD Mullins: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 825 770 998 1024 [+3%] In this tough integer workload that uses shared memory (as cache), Mullins only sees a 3% improvement. GCN shows its power being 25% faster than Intel’s GPUs – TDP notwhitstanding.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1106 ? 1280 1423 [+11%] With less rounds, Mullins is now 11% faster – finally a good improvement and again 28% faster than Atom’s GPU.
AMD Mullins: GPGPU Hash
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 309 ? 59 282 [+4.7x] This 64-bit integer compute-heavy wokload seems to have triggered a driver bug in Kabini since Mullins is almost 5x (five times) faster – perhaps 64-bit integer operations were emulated using int32 rather than native? Surprisingly Atom’s EV8 is faster (+9%) – not something we’d expect to see.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 1187 1331 1618 1638 [+1%] In this integer compute-heavy workload, Mullins is just 1% faster (within margin of error) – which again proves GPU has not changed at all vs. older Kabini. At least it’s faster than both Intel GPUs, 38% faster than Atom’s.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2764 ? 2611 3256 [+24%] SHA1 is less compute-heavy but here we see a 24% Mullins improvement, again likely a driver “fix”. This allows it to beat Atom’s GPU 17% – showing that driver optimisations can make a big difference.
AMD Mullins: GPGPU Financial
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 299.8 280.5 248.3 326.7 [+31%] Starting with the financial tests, Mullins flies off with a 31% improvement over the old Kabini, with is just as well as both Intel GPUs are coming strong – it’s 9% faster than Atom’s GPU. One thing’s for sure, Intel’s EV8 GPU is no slouch.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) n/a (no FP64) n/a (no FP64) 21 21.2 [+1%] AMD’s GCN supports native FP64, but here Mullins is just 1% faster than Kabini (within margin of error), unable to replicate the FP32 improvement we saw.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 28 36.5 32.3 30.9 [-4%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and here Mullins somehow manages to be slower (-4%) – likely due to driver differences. Both Intel GPUs are coming strong, with Core M’s GPU 20% faster. Considering how fast the GCN shared memory is – we expected better.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 1.85 1.87 [+1%] Switching to FP64 on AMD’s GPUs, Mullins is now 1% faster (within margin of error). Luckily Intel’s GPUs do not support FP64.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 61.9 54.9 32.9 46.3 [+40%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Mullins is 40% faster here (again driver change) – but surprisingly cannot match either Intel GPUs, with Atom’s GPU 32% faster! Again, we see just how much Intel has improved the GPU in Atom – perhaps too much!
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 5.39 5.59 [+3%] Switching to FP64 we now see a little 3% improvement for Mullins, but better than the 1% we saw before…
AMD Mullins: GPGPU Scientific
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 45 44.1 43.5 41.5 [-5%] GEMM is quite a tough algorithm for our GPUs and Mullins manages to be 5% slower than Kabini – agin this allows Intel’s GPUs to win, with Atom’s GPU just 8% faster – but a win is a win. Mullins’s GPU is starting to look underpowered considering the much higher TDP.
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.11 3.73 [-9%] Swithing to FP64, Mullins now manages to be 5% slower than Kabini – thankfully Intel’s FPUs don’t support FP64…
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 9 8.94 7.89 9.5 [+20%] FFT involves many kernels processing data in a pipeline – and Mullins now manages to be 20% faster than Kabini – again, just as well as Intel’s GPUs are hot on its tail – and it is just 5% faster than Atom’s GPU!
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 2.2 3 [+36%] Switching to FP64, Mullins is now 36% faster than Kabini – again likely a driver improvement.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 65 50 58 63 [+9%] In our last test we see Mullins is 9% faster – but not enough to beat Atom’s GPU which is 1% faster but faster still. Anybody expected that?
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.75 4.74 [=] Switching to FP64, Mullins scores exactly the same as Kabini.

Firstly, Mullins’s GPU scores are unchanged from Kabini; due to driver optimisations/fixes (as well as kernel optimisations) sometimes Mullins is faster but that’s not due to any hardware changes. If you were expecting more, you are to be disappointed.

Intel’s EV8 GPUs in the new Atom (CherryTrail) as well as Core M (Broadwell) now can keep up with it and even beat it in some tests. The crushing GPGPU advantage AMD’s APUs used to have is long gone. Considering the the TDP differences (4-5x higher) the Mullins’s GPU looks underpowered – the number of cores should at least been doubled to maintain its advantage.

Unless Atom (CherryTrail) is more expensive – there’s really no reason to choose Mullins, the power advantage of Atom is hard to be denied. The only positive for AMD is that Core M looks uncompetitive vs. Atom itself, but then again Intel’s 15W ULV designs are far more powerful.

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) AMD h264Encoder (hardware accelerated) AMD h264Encoder (hardware accelerated) Both are using their own hardware-accelerated transcoders for H264.
AMD Mullins: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 5 ? 2 2.14 [+7%] We see a small but useful 7% bandwidth improvement in Mullins vs. Kabini, but even Atom is over 2x (twice) as fast.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 4.75 ? 2 2.07 [+3.5%] When just using the H264 encoder we only see a small 3.5% bandwidth improvement in Mullins. Again, Atom is over 2x as fast.

We see a minor 3.5-7% improvement in Mullins, but the new Atom blows it out of the water – it is over twice as fast transcoding H.264! Poor Mullins/Kaveri are not even a good fit for HTPC (NUC/Brix) boxes…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
Memory Configuration 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) Except Core M, all APUs have a single memory controller, though Atom can also be configured with 2 channels.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 32 64 / 32 Surprisingly AMD’s GCN has 1/2 the shared memory of Intel’s EV8 (32 vs. 64) but considering the low number of threads-per-CU (256) only kernels making very heavy use of shared memory would be affected, still better more than less.
L1 / L2 / L3 Caches (kB) 256kB? L2 256kB? L2 16kB? L1 / 256kB? L2 16kB? L1 / 256kB? L2 Caches sizes are always pretty “hush hush” but since core has not changed, we would expect the same cache sizes – with GCN also sporting a L1 data cache.
AMD Mullins: GPGPU Memory BW
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 11 10.1 8.8 5.5 [-38%] OpenCL memory performance has surprisingly taken a bit hit in Mullins, most likely a driver bug. We shall see whether DirectX Compute is similarly affected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 2.09 3.91 4.1 2.88 [-30%] Upload bandwidth is similarly affected, we measure a 30% decrease.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 2.29 3.79 3.18 2.9 [-9%] Upload bandwidth is the least affected, just 9% lower.
AMD Mullins: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 829 274 1973 693 [-1/3x] Even though the core is unchanged, latency is 1/3 of Kabini. Since we don’t see a comparative increase in performance, this again points to a driver issue.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1669 ? ? 817 Going out-of-page does not increase latency much.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 279 ? ? 377 Sequential access brings the latency down to about 1/2, showing the prefetchers do a good job.
GPGPU Memory Latency Constant Memory Latency (ns) 209 ? 629 401 [-33%] The L1 (16kB) cache does not cover the whole constant memory (64kb) – and is not lower than global memory. There is no advantage to using constant memory.
GPGPU Memory Latency Shared Memory Latency (ns) 201 ? 20 16 [-20%] Shared memory is a little big faster (20% lower). We see that shared memory latency is much lower than constant/global lantency (16 vs. 401) – denoting dedicated shared memory. On Intel’s EV8 GPU there is no change (201 vs. 209) – which would indicate global memory used as shared memory.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1234 ? 2369 691 [-70%] We see a massive latency reduction – again likely a driver optimisation/fix.
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 353 ? ? 353 Sequential access brings the latency down to a quarter (1/4x) – showing the power of the prefetchers.

The optimisations in newer drivers make a big difference – though the same could apply to the previous gen (Kabini). The dedicated shared memory – compared to Intel’s GPUs – likely help GCN achieve its performance.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Shader
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 127.6 ? 128.8 129.6 [=] Starting with DirectX FP32, we see no change in Mullins – not even the DirectX driver has changed.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 121.8 172 124 124.4 [=] OpenGL does not change matters, Mullins scores exactly the same as its predecessor. But here we see Core M pulling ahead, an unexpected change.
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 109.5 ? 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 121.8 170 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change either.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 18 ? 8.9 8.91 [=] Unlike OpenCL driver, DirectX Intel driver does support FP64 – which allows Atom’s GPU to be at least 2x (twice) as fast as Mullins/Kebini.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 26 46 12 12 [=] As above, Intel OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. This allows Atom’s GPU to be over 2x faster than Mullins/Kabini again – while Core M’s GPU is almost 4x (four times!) faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.34 (emulated) ? 1.6 (emulated) 1.66 (emulated) [+3%] Here we’re emulating (mantissa extending) FP128 using FP64: EV8 stumbles a bit allowing Mullins/Kabini to be a little bit faster despite what we saw in FP64 test.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1.1 (emulated) 3.4 (emulated) 0.738 (emulated) 0.738 (emulated) [=] OpenGL does change the results a bit, Atom’s GPU is now faster (+50%) while Core M’s GPU is far faster (+5x). Heavy shaders seem to take their toll on GCN’s GPU.

Unlike GPGPU, here Mullins scores exactly the same as Kabini – neither the DirectX nor OpenGL driver seem to make a difference. But what is different is that Intel’s GPUs support FP64 natively in both DirectX/OpenGL – making it much faster 3-5x than AMD’s GCN. If OpenCL driver were to support it – AMD woud be in trouble!

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 11.18 12.46 8 9.7 [+21%] DirectX bandwidth does not seem to be affected by the OpenCL “bug”, here we see Mullins having 21% more bandwidth than Kaveri using the very same memory. Perhaps the memory controller has seen some some improvements after all.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.83 5.29 3 3.61 [+20%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again Mullins does well with 20% more bandwidth.
Video Memory Benchmark Download Bandwidth (GB/s) 2.1 1.23 3 3.34 [+11%] Download bandwidth improves by 11% only, but better than nothing.

Unlike OpenCL, we see DirectX bandwidth increased by 11-20% – while using the same memory. Hopefuly AMD will “fix” the OpenCL issue which should help kernel performance no end.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mullins’s GPU is unchanged from its predecessor (Kabini) but a few driver optimisations/fixes allow it to score better is many tests by a small margin – however these would also apply to the older devices. There isn’t really more to be said – nothing has really changed.

But time does not stand still – and now Intel’s EV8 GPU that powers the new Atom (CherryTrail) as well as Core M (Broadwell) is hot on its heels and even manages to beat it in some tests – not something we’re used to seeing in AMD’s APUs. Mullins’s GPU is looking underpowered.

If we now remember that Mullins’s TDP is 15W vs. Atom at 2.6-4W or Core M at 4.6W – it’s really not looking good for AMD: it’s CPU performance is unlikely to be much better than Atom’s (we shall see in CPU AMD A4 “Mullins” performance) – and at 3-5x (three to five times) more power woefully power inefficient.

Let’s hope that the next generation APUs (aka Mullins’ replacement) perform better.