AMD A4 “Mullins” APU CPU: Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • Crypto: Random number generator.
  • Security: Platform Security Processor (PSP) included in the SoC (ARM based).

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing Mullins with its predecessor (Kabini) as well as its competition from Intel.

APU Specifications Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
Cores (CU) / Threads (SP) 4C / 4T 4C / 4T 2C / 4T 4C / 4T We still have 4 cores and 4 threads just like Atom (old and new) – only Core M has 2 cores with HT – we shall see whether this makes a big difference.
Speed (Min / Max / Turbo) 480-1600-2400 (6x-20x-30x) 500-800-2000 (5x-8x-20x) 1000-1500 (10x-15x) 1000-1800-2400 (10x-18x-24x) Mullins is clocked a bit higher (1.8GHz vs. 1.5 – 20% faster) but also supports Turbo (up to 2.4GHz – up to 60% faster) which should give it a big advantage over old Kabini. Both Atom and Core M also depend on opportunistic Turbo for most of their performance. As Mullins/Kabini are 15W rated, they should be able to Turbo higher and for longer – at least in theory.
Power (TDP) 2.4W 4.5W 15W 15W [=] TDP remains the same at 15W which is a bit disappointing considering the new Atom is 2.4-4W rated, we’re taling between 3-5x (five times) more power!
L1D / L1I Caches 4x 24kB 6-way / 4x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 2-way 4x 32kB 8-way / 4x 32kB 2-way No change in L1 caches which pretty much Atom, comparatively Core M has half the caches.
L2 Caches 2x 1MB 16-way 4MB 16-way 2MB 16-way 2MB 16-way No change in L2 cache; here Core M has twice as much cache – same size as a normal i7 ULV. It’s a pity AMD was not able to increase the caches.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Haswell introduces AVX2 which allows 256-bit integer SIMD (AVX only allowed 128-bit) and FMA3 – finally bringing “fused-multiply-add” for Intel CPUs. We are now seeing CPUs getting as wide a GP(GPUs) – not far from AVX3 512-bit in “Phi”.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
Arithmetic Native
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 35.91 SSE4 [+20%] 31.84 AVX2 25.12 SSE4 28.77 SSE4 [+18%] Mullins like Kabini has no AVX2, but is still 18% faster than it (clocked 20% faster), unfortunately the new Atom manages to beat it. Not the best of starts.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 18.43 AVX [+13%] 21.56 AVX/FMA 13.55 AVX 16.2 AVX [+19%] Mullins has no FMA either, so is again 19% faster than Kabini – it shows the ALU and FPUs are unchanged. Again Atom manages to be faster.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 12.34 AVX [+23%] 13.49 AVX/FMA 8.44 AVX 10 AVX [+18%] With FP64 we see the same 18% difference – and Atom is still 20% faster.
We only see a 18-19% improvement in Mullins – in line with clock speed (+20%) with Turbo not doing much. The new CherryTrail Atom is thus 14-21% faster than it, not something you expect considering the hugely different TDP. Time does not stand still and Mullins is outclassed here.
SIMD Native
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 48.7 AVX [-20%] 70.8 AVX2 58.37 60.76 [+4%] Without AVX2, Mullins can only manage a paltry 4% improvement over Kabini – the only “silver lining” is that Atom is 20% slower than it – unlike what we saw before. Naturally Core M with AVX2 runs away with it.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 14.5 AVX [-38%] 24.5 AVX2 21.92 AVX 23.26 AVX [+6%] With a 64-bit integer workload, the improvement increases to 6%, less than clock difference – but thankfully Atom is much slower (by half) – naturally Core M is the winner.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 0.512 [+75%] 0.382 0.246 0.292 [+18%] This is a tough test using Long integers to emulate Int128, but here we see the full 18% improvement over Kabini – but now Atom is a huge 75% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 41.5 AVX [-4%] 61.3 FMA 36.91 AVX 43 AVX [+16%] In this floating-point AVX/FMA algorithm, Mullins returns to being 16% faster and a whisker faster than Atom (4%). With FMA, Core M is almost 50% faster still.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 15.9 AVX [-31%] 36.48 FMA 19.57 AVX 22.9 AVX [+17%] Switching to FP64 code, we see a 17% improvement for Mullins which allows it to be 30% faster than Atom.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 0.81 AVX [-37%] 1.69 FMA 1.27 AVX 1.27 AVX [=] In this heavy algorithm using FP64 to mantissa extend FP128, we see no improvement whatsoever – at least Atom is 37% slower; and yes, Core M is faster still.
Lack of AVX2/FMA and Turbo that does not seem to engage makes Mullins stuggle to be more than 16-18% faster than Kaveri – but thankfully it can beat its Atom rival sometimes by a good 30% amount. All in all, it does better with SIMD code than we’ve seen elsewhere though without any core changes it is showing its age…
Cryptography Native
BenchCrypt Crypto AES-256 (GB/s) 1.44 AES HWA [-55%] 2.59 AES HWA 3 AES HWA 3.14 AES HWA [+5%] All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – here Mullins is 5% faster than Kabini and 2x (twice) as fast as Atom, it even overtakes Core M with its dual-channel controller!
BenchCrypt Crypto AES-128 (GB/s) 2 AES HWA [-40%] ? AES HWA 3 AES HWA 3.3 AES HWA [+10%] What we saw with AES-256 was no fluke: less rounds do make some difference, Mullins is now 10% faster and 65% faster than Atom!
BenchCrypt Crypto SHA2-256 (GB/s) 0.572 AVX [-24%] 0.93 AVX2 0.708 AVX 0.659 AVX [-7%] In this tough AVX compute test, Mullins is unexpectedly 7% slower than the old Kabini – but Atom is still slower. But with SHA HWA in the next Atom, AMD will quickly be at a big disadvantage…
BenchCrypt Crypto SHA1 (GB/s) 1.17 AVX [-23%] ? AVX2 1.42 AVX 1.34 AVX [-6%] With a less complex algorithm – we still see Mullins 6% slower than Kabini – and again Atom is slower.
BenchCrypt Crypto SHA2-512 (GB/s) ? AVX ? AVX2 0.511 AVX 0.477 AVX [-7%] By using 64-bit integers this is pretty much the most complex hashing algorithm and thus tough for all CPUs – and here we see Mullins 7% slower again.
Mullins misses both AVX2 and the forthcoming SHA HWA – but manages to extract more memory bandwidth and thus is 5-10% faster than Kabini, and also much faster than Atom. Somehow it manages to be slower with hashing whatever algorithm – but remains much faster than Atom.
Financial Analysis Native
BenchFinance Black-Scholes float/FP32 (MOPT/s) 14.15 [-26%] 21.88 17.55 18.87 [+7%] In this non-SIMD test we start with a 7% improvement over Kabini, good but less than we expected – but at least faster than Atom.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 11.66 [-27%] 17.6 13.43 15.82 [+18%] Switching to FP64 code, we see a good 18% improvement – and victory over Atom again.
BenchFinance Binomial float/FP32 (kOPT/s) 3.85 [-22%] 5.1 1.31 4.33 [+3.3x] Binomial uses thread shared data thus stresses the cache & memory system; Mullins has improved by a huge 3.3x (over three times) – and a great win over Atom (but not Core M). It seems the new memory improvements do help a lot.
BenchFinance Binomial double/FP64 (kOPT/s) 4.13 [-31%] 5.02 2.66 5.91 [+2.2x] With FP64 code, Mullins is “only” 2.2x (over two times) faster than Kabini – and this again gives it a big win over its Atom competition – as well as over Core M!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 2.67 [-29%] 3.73 3.23 3.71 [+15%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; here Mullins is just 15% faster but it’s enough to tie with Core M and leave Atom in the dust.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 2.43 [-30%] 3.0 2.67 3.43 [+28%] Switching to FP64 we see a big 28% improvement – Mullins manages to beat both Atom and Core M by a good measure.
Somehow Mullins managed to redeem itself – beating Atom in all tests and even Core M in some of the tests. Running financial tests on a Mullins tablet should work better than on an Atom or Core M one.
Scientific Analysis Native
BenchScience SGEMM (GFLOPS) float/FP32 12.42 AVX [-7%] 13.99 FMA 11.43 AVX 13.22 AVX [+16%] In this tough SIMD algorithm, Mullins sees a good 16% improvement beating Atom and getting within a whisker to Core M – even without FMA.
BenchScience DGEMM (GFLOPS) double/FP64 6.09 AVX [-13%] 9.61 FMA 6.62 AVX 6.93 AVX [+5%] With FP64 SIMD code, Mullins is just 5% faster – but it’s enough to beat Atom – though not Core M. Still, a good improvement.
BenchScience SFFT (GFLOPS) float/FP32 5.17 AVX [+72%] 4.11 FMA 2.91 AVX 3 AVX [+3%] FFT also uses SIMD and thus AVX but stresses the memory sub-system more: here Mullins sees only a 3% improvement, not enough to beat Atom which is over 70% faster still. Mullins has its limits.
BenchScience DFFT (GFLOPS) double/FP64 2.83 AVX [+52%] 3.66 FMA 1.54 AVX 1.85 AVX [+20%] With FP64 code, Mullins improves by a large 20% – but again not enough to beat Atom which is now over 50% faster still.
BenchScience SNBODY (GFLOPS) float/FP32 1.58 AVX [-31%] 2.64 FMA 2.1 AVX 2.27 AVX [+8%] N-Body simulation is SIMD heavy but many memory accesses to shared data so Mullins is just 8% faster than Kaveri, but enough to beat Atom (by 43%). Unlike FFT, N-Body again agrees with Mullins/Kabini.
BenchScience DNBODY (GFLOPS) double/FP64 1.76 AVX [-30%] 3.71 FMA 2.09 AVX 2.51 AVX [+20%] With FP64 code Mullins improves again by 20% – but now more than enough to beat Atom (by 42%) – though not enough to beat Core M.
With highly optimised SIMD AVX code, Mullins sees a 5-20% improvement – which allows it to beat Atom in most tests – a good result from the rout we saw before.
Inter-Core Native
BenchMultiCore Inter-Core Bandwidth (GB/s) 1.69 [-52%] 8.5 3.0 3.47 [+15%] With unchanged L1/L2 caches Mullins relies on its higher rated speed – and 15% is a good improvement over Kabini. It’s got 2x (two times) the bandwidth Atom manages to muster – but way below Core M’s which has over 2.4x more still. We see how all these caches perform in the Cache & Memory AMD A4 “Mullins” performance article.
BenchMultiCore Inter-Core Latency (ns) 179 [1/3.5x] 76 66 31 [-33%] Latency, however, sees a massive 33% decrease, more than we’d expect – and surprisingly is way lower than Atom (1/3.5x) and even lower than Core M (1/2x).

While it does not bring any new instruction sets (AVX2, FMA, SHA HWA) and Turbo that does not seem to engage, Mullins’s 20% clock improvement does show and brings a corresponding 5-19% increase in performance over Kabini in most tests.

Against Atom, the scores are all over the place, sometimes Atom (CherryTrail) is 20-70% faster, other times Mullins is 20-55% faster. If they were rated the same TDP-wise that would be a good result – but as Mullins is rated 15W vs. 2.6-4W that’s not really power efficient. Core M is invariably faster than either in just about all tests.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 8.x/10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64 SP1, latest Intel drivers. .Net 4.5.x, Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
.Net Arithmetic
BenchDotNetAA .Net Dhrystone (GIPS) 3.85 [-33%] 2.95 4.0 5.78 [+44%] .Net CLR performance improves by a huge 44% – a great start and in line to rated clock increase and enough to beat the Atom by 33%.
BenchDotNetAA .Net Whetstone final/FP32 (GFLOPS) 9 [-3%] 10.8 12.49 9.24 [-23%] Floating-Point CLR performance takes a 23% hit over Kabini – but thankfully just a bit faster than Atom (by just 3%). Something in the new CLR does not agree with it.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 10 [-20%] 12.71 13.39 12.49 [-7%] FP64 CLR performance also sees a more modest 7% decrease a performance, but again still 20% faster than Atom.
With .Net we see a big variation from 23% lower to 44% higher performance than Kabini, but in all cases higher than Atom (between 3-33%). It is strange to see such a big variance, but the CLR changes may have something to do with it.
.Net Vectorised
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 11.34 [+7.8%] 11.25 7.12 10.52 [+47%] Just as we saw with Dhrystone, this integer workload sees a 47% improvement with Mullins – something in the CLR does agree with it – though not as much as Atom which is 8% faster still.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 6.36 [-12%] 5.62 1.78 7.2 [+4.04x] With 64-bit integer vectorised workload, we see a massive 4x (four times) improvement over Kabini – and 12% faster than Atom.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 2.6 [+25%] 4.14 2.01 2.08 [+3%] Switching to single-precision (FP32) floating-point code, we see only a minor 3% improvement – and here Atom is 25% faster still.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 6.25 [-12%] 9.36 5.97 7.1 [+19%] Switching to FP64 code, Mullins is 19% faster (in line with clock increase) and 12% faster than Atom. While unlikely compute tasks are written in .Net rather than native code, small compute code does benefit.
Vectorised .Net improves between 3-47% over Kabini (except the 64-bit integer “fluke”), and thus sometimes faster but sometimes slower than Atom.

We see a big variation here – unlike what we saw with native / SIMD code – likely due to CLR changes – but generally welcome. Against Atom we see an even larger variation – faster and slower, but overall competitive.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Despite no core changes, Mullins is helped by higher rated clock (and Turbo when that works) – which gives it a good 5-19% performance improvement in most tests over its Kabini predecessor. Any new instruction set support (AVX2, FMA, SHA HWA, etc.) will have to wait for the next model.

Unfortunately, time does not stand still – Atom has seen more core improvements (though no new instruction support either) – which means it is a far tougher competition than what Kabini had to deal with. While competitive performance wise with it – as Mullins is rated at 3-5x the power (15W vs. 2.6-4W) its performance efficiency is low. While it can be “powered down” to hit a lower TDP, performance will naturally suffer – so then it would no longer be competitive with Atom.

Previously, AMD APUs relied on their much more powerful GPUs (see AMD A4 “Mullins” APU GPGPU (Radeon R4) performance) to make up for lower CPU performance and power efficiency – but now the latest Intel APUs (be they Atom or Core) are very much competitive – thus their main advantage has gone.

The only advantage would be cost – assuming that Mullins would be much cheaper than even Atom, though that is difficult to see. Thus there is not much where Mullins would be the top choice.

We’ll have to wait for the next AMD APU model – though, again, time does not stand still – and future Atom/Core M models will bring brand-new goodies (DDR4, new instruction sets, etc.) which may well make even tougher opposition. We shall have to wait and see…

Intel Atom X7 (CherryTrail 2015) CPU: Closing on Core M?

Intel Logo

What is CherryTrail (Braswell)?

“CherryTrail” (CYT) is the next generation Atom “APU” SoC from Intel (v3 2015) replacing the current Z3000 “BayTrail” (BYT) SoC which was Intel’s major foray into tablets (both Windows & Android). The “desktop” APUs are known as “Braswell” (BRS) while the APUs for other platforms have different code names.

BayTrail was a major update both CPU (OOS core, SSE4.x, AES HWA, Turbo/dynamic overclocking) and GPU (EV7 IvyBridge GPGPU core) so CherryTrail is a minor process shrink – but with a very much updated GPGPU – updated to EV8 (as latest Core Broadwell).

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding Intel CPU performance:

Hardware Specifications

We are comparing Atom processors with the Core M processors. New models may run at higher (or lower) frequencies than the ones they replace, thus performance delta can vary.

APU Specifications Atom Z3770 (BayTrail) Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) Comments
Cores (CU) / Threads (SP) 4C / 4T 4C / 4T 2C / 4T The new Atom still has 4C and no HT same as the old Atom; Core M has 2 cores but with HT so still 4 threads in total – we shall see whether this makes a big difference.
Speed (Min / Max / Turbo) 533-1467-2400 (6x-11x-18x) 480-1600-2400 (6x-20x-30x) 500-800-2000 (5x-8x-20x) The new CherryTrail Atom has lower BCLK (80MHz) compared to BayTrail (133MHz) and thus higher multipliers while LFM (Low), MFM (Rated) and Turbo (TFM) speeds are very similar.
Power (TDP) 2.4W 2.4W 4.5W TDP remains the same for the new Atom while Core M needs around 2x (twice) as much power.
L1D / L1I Caches 4x 24kB 6-way / 4x 32kB 8-way 4x 24kB 6-way / 4x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way No change in L1 caches on the new Atom; comparatively Core M has half the caches.
L2 Caches 2x 1MB 16-way 2x 1MB 16-way 4MB 16-way No change in L2 cache; here Core M has twice as much cache – same size as a normal i7 ULV.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Haswell introduces AVX2 which allows 256-bit integer SIMD (AVX only allowed 128-bit) and FMA3 – finally bringing “fused-multiply-add” for Intel CPUs. We are now seeing CPUs getting as wide a GP(GPUs) – not far from AVX3 512-bit in “Phi”.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Atom Z3770 (BayTrail) Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) Comments
          Intel Braswell: CPU Arithmetic
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 15.11 SSE4 35.91 SSE4 [+55%] 31.84 AVX2 [-12%] CherryTrail has no AVX2 but manages to be 55% faster than old Atom and even a big faster than Core M with AVX2! A great start!
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 11.09 AVX 18.43 AVX [+8.4%] 21.56 AVX/FMA [+17%] CherryTrail has no FMA either, but here it is only 8% faster than old Atom. Core M does better here – it’s 16% faster still.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 8.6 AVX 12.34 AVX [+2.8%] 13.49 AVX/FMA [+17%] With FP64 we see and even smaller 3% gain. Core M remains 16% faster still.
We see a big improvement with integer workload of 50% but minor 3-8% with floating-point. Sure Core M is faster (16%) but not by a huge amount to justify power and cost increase.
Intel Braswell: CPU Vectorised SIMD
CPU Multi-Media Vectorised SIMD Native Integer (Int32) Multi-Media (Mpix/s) 25.1 AVX 48.7 AVX [+25%] 70.8 AVX2 [+45%] Again CherryTrail has to do without AVX2, but is still 25% faster than old Atom – a decent improvement. Here though AVX2 of Core M has a sizeable 45% improvement.
CPU Multi-Media Vectorised SIMD Native Long (Int64) Multi-Media (Mpix/s) 9 AVX 14.5 AVX [+4.6%] 24.5 AVX2 [+69%] With a 64-bit integer workload, the improvement drops to about 5%. Here Core M with AVX2 is 70% faster still – finally a big improvement over Atom.
CPU Multi-Media Vectorised SIMD Native Quad-Int (Int128) Multi-Media (Mpix/s) 0.109 0.512 [+3.3x] 0.382 [-25%] This is a tough test using Long integers to emulate Int128, but here CherryTrail manages to be over 3x faster than old Atom – and even faster than Core M! Without SIMD the new Atom does much better than even Core M.
CPU Multi-Media Vectorised SIMD Native Float/FP32 Multi-Media (Mpix/s) 18.9 AVX 41.5 AVX [+42%] 61.3 FMA [+47%] In this floating-point AVX/FMA algorithm, CherryTrail does much better returning with a 42% improvement over old Atom. Core M also returns to a 47% improvement still – seems to be the par for SIMD ops.
CPU Multi-Media Vectorised SIMD Native Double/FP64 Multi-Media (Mpix/s) 8.18 AVX 15.9 AVX [+93%] 36.48 FMA [+2.3x] Switching to FP64 code, CherryTrail is now almost 2x as fast as old Atom – with Core M being over 2x faster still.
CPU Multi-Media Vectorised SIMD Native Quad-Float/FP128 Multi-Media (Mpix/s) 0.5 AVX 0.81 AVX [+62%] 1.69 FMA [+2.1x] In this heavy algorithm using FP64 to mantissa extend FP128, CherryTrail manages a 62% improvement and again Core M is over 2x still.
Lack of AVX2/FMA and higher Turbo speed prevents the new Atom from crushing the old Atom – but still allows a 25-100% (2x as fast) improvement, a significant improvement. Here, though, with AVX2 and FMA – Core M is 45-100% faster still thus could be worth it if more compute power is required.
Intel Braswell: CPU Crypto
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 1.09 AES HWA 1.44 AES HWA [+32%] 2.59 AES HWA [+79%] All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – here CherryTrail is 32% faster – thus we’d predict its memory controller can yield +35% more bandwidth. But even with just half no. cores Core M is 80% faster still – likely due to its dual-channel controller.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.51 AES HWA 2 AES HWA [+32%] ? AES HWA What we saw with AES-256 was no fluke: less rounds don’t make any difference, new Atom is 32% faster.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 0.143 AVX 0.572 AVX [+4x] 0.93 AVX2 [+63%] SHA HWA will come for the next Atom arch, so neither CPU supports it. But CherryTrail can manage to be a whopping 4x (four times) faster than old Atom. But Core M with AVX2 still manages to be 63% faster.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 0.388 AVX 1.17 AVX [+3x] ? AVX2 With a less complex algorithm – we see a 3x (three times) improvement over old Atom.
CherryTrail still misses AVX2 or forthcoming SHA HWA – but due to (expected) higher memory bandwidth it still improves over old Atom – while raw compute is 3-4x faster – a huge improvement. Due to its dual-channel controller and AVX2 Core M is still faster than it by a good 60-80%.
Intel Braswell: CPU Finance
CPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 8.0 14.15 [+76%] 21.88 [+54%] In this non-SIMD test we start with a good 76% improvement over old Atom – while Core M is 50% faster still. CherryTrail shows its prowess in FPU processing.
CPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7.3 11.66 [+59%] 17.6 [+50%] Switching to FP64 code, CherryTrail is still 60% faster – and Core M remains 50% faster still.
CPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1.76 3.85 [+2.2x] 5.1 [+32%] Binomial uses thread shared data thus stresses the cache & memory system; CherryTrail is over 2x (twice) as fast as old BayTrail – with Core M just 32% faster. It seems the new memory improvements do help a lot.
CPU Finance Benchmark Binomial double/FP64 (kOPT/s) 1.33 4.13 [+3.1x] 5.02 [+21%] With FP64 code CherryTrail is now a whopping 3x (three times) faster – a massive improvement. Core M is just 20% faster still.
CPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1.37 2.67 [+94%] 3.73 [+39%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; CherryTrail remains about 2x faster than old Atom – showing just how much the memory improvements help. Core M is still 40% faster.
CPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1.54 2.43 [+57%] 3.0 [+23%] Switching to FP64 the improvement drops to around 60% – still a big improvement. Core M’s improvement drops to 23%.
The financial tests show big improvements of 60-200% – complex compute tasks are now very much possible on Atom. Without help from AVX2 and FMA – Core M can only be 20-50% faster still, not the improvement we were hoping.
Intel Braswell: CPU Science
CPU Science Benchmark SGEMM (GFLOPS) float/FP32 6.19 AVX 12.42 AVX [+2x] 13.99 FMA [+12%] In this tough SIMD algorithm, again CherryTrail manages to be 2x as fast as old Atom; even with the help of FMA, Core M is only 12% faster. The new Atom is just running away with it.
CPU Science Benchmark DGEMM (GFLOPS) double/FP64 3.38 AVX 6.09 AVX [+80%] 9.61 FMA [+58%] With FP64 SIMD code, CherryTrail remains about 80% faster, just under 2x. Here Core M shows its power, it’s almost 60% faster still.
CPU Science Benchmark SFFT (GFLOPS) float/FP32 2.83 AVX 5.17 AVX [+82%] 4.11 FMA [-21%] FFT also uses SIMD and thus AVX but stresses the memory sub-system more: CherryTrail remains 82% faster than old Atom and here it beats even Core M.
CPU Science Benchmark DFFT (GFLOPS) double/FP64 2 AVX 2.83 AVX [+41%] 3.66 FMA [+29%] With FP64 code, the improvement is reduced to just 41% but it is still significant. Core M’s improvement drops to just 30%.
CPU Science Benchmark SNBODY (GFLOPS) float/FP32 0.929 AVX 1.58 AVX [+70%] 2.64 FMA [+67%] N-Body simulation is SIMD heavy but many memory accesses to shared data but CherryTrail still manages a 70% improvement, with Core M 70% faster still.
CPU Science Benchmark DNBODY (GFLOPS) double/FP64 1.25 AVX 1.76 AVX [+40%] 3.71 FMA [+2.1x] Unlike what we saw before, with FP64 code the improvement drops to 40% as with DFFT. Core M is over 2x faster still.
With highly optimised SIMD AVX code, we see a similar 40-100% improvement in performance; as we said before complex algorithms run a lot faster on the new Atom. With FP64 code, Core M’s FPU shows its power but at what cost?
Intel Braswell: CPU Multi-Core Transfer
CPU Multi-Core Benchmark Inter-Core Bandwidth (GB/s) 1.63 1.69 [+5%] 8.5 [+5x] With unchanged L1/L2 caches and similar rated/Turbo clocks – it is not surprising the new Atom does not improve on inter-core bandwidth. Core M shows it’s prowess – managing over 5x higher bandwidth. We see how all these caches perform in the Cache & Memory Atom (CherryTrail) performance article.
CPU Multi-Core Benchmark Inter-Core Latency (ns) 182 179 [-6%] 76 [-68%] Latency, however, sees a 5% decrease, likely due to the higher clock of CherryTrail during run (due to higher rated speed) since the caches are unchanged. Core M’s latencies seem to be 1/2 (half) of Atom.

While it does not bring any new instruction sets, CherryTrail improves significantly over the old Atom (30-100%) – much more than we’ve seen in Core series from arch to arch. It does not support new instruction sets (e.g. AVX2, FMA, SHA HWA) nor new caches – which perhaps it is just as well as Core M is not much faster.

Considering just how much BayTrail improved over the old Atom arch, the improvements are particularly impressive.

About the only “issue” is FP64 performance where Core M shows its power – whether using SIMD AVX/FMA or FPU code.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 8.x/10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64 SP1, latest Intel drivers. .Net 4.5.x, Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Atom Z3770 (BayTrail) Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) Comments
Intel Braswell: .Net Arithmetic
.Net Arithmetic Benchmark .Net Dhrystone (GIPS) 2.69 3.85 [+43%] 2.95 [-33%] .Net CLR performance improves by 43% – a great start and in line to what we saw with native code – again faster than Core M.
.Net Arithmetic Benchmark .Net Whetstone final/FP32 (GFLOPS) 2.91 9 [+3.1x] 10.8 [+20%] Floating-Point CLR performance improves by a whopping 3x (three times)! While apps should really do compute tasks in native code, this ensures that .Net (or Java) apps will fly. Core M is just 20% faster still.
.Net Arithmetic Benchmark .Net Whetstone double/FP64 (GFLOPS) 5.89 10 [+69%] 12.71 [+27%] FP64 CLR performance improves by a lower 70% but still great improvement. Core M’s performance improves almost 30%.
Just like native SIMD code, we see great improvement of 40-200% making CherryTrail significantly faster running .Net apps than old Atom. Core M is only 20-30% faster still.
Intel Braswell: .Net Vectorised
.Net Vectorised Benchmark .Net Integer Vectorised/Multi-Media (MPix/s) 6.42 11.34 [+76%] 11.25 [=] A good start, we see vectorised .Net code a huge 76% faster in CherryTrail – even faster than Core M!
.Net Vectorised Benchmark .Net Long Vectorised/Multi-Media (MPix/s) 0.761 6.36 [+8.3x] 5.62 [-12%] With 64-bit integer vectorised workload, we see a pretty unbelievable 8x improvement – the old Atom really has an issue with this test. As before, CherryTrail is faster than Core M.
.Net Vectorised Benchmark .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 0.922 2.6 [+2.82x] 4.14 [+59%] Switching to single-precision (FP32) floating-point code, CherryTrail is almost 3x faster. It seems that whatever workload you have in .Net, new Atom is a good deal faster. Core M is 60% faster still.
.Net Vectorised Benchmark .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 2.52 6.25 [+2.48x] 9.36 [+49%] Switching to FP64 code, CherryTrail’s improvement drops to 2.5x – but Core M 50% faster still! While unlikely compute tasks are written in .Net rather than native code, small compute code does benefit.
Vectorised .Net improves even better, between 80-200% whether integer or floating-point workload. Unlike what we saw when we reviewed the Core CPUs, Atom does improve from arch to arch.

Just like native code, we see a huge 40-200% improvement when running .Net or Java code. Any “modern” apps (either Universal, Metro, WPF) should run very much faster – especially as heavily optimised SIMD code has also been shown to be a lot faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Despite being just a process shrink with minor improvements, CherryTrail is very much faster (25-100%) than the old Atom (BayTrail) within the same power envelope (TDP). Let’s not forget BayTrail itself improved greatly over the previous Atoms – thus new Atom is light-years away in performance. And that’s before we mention the very much improved EV8 GPGPU whose performance we saw earlier, see CPU Atom Z8700 (CherryTrail) GPGPU performance.

Any major CPU improvements including instruction set support (AVX2, FMA, SHA HWA) will have to wait for the next Atom arch – which should make it a formidable APU – but what we have here is still significant.

Coupled with its much improved GPGPU performance, we can now see just why Microsoft has selected it for Surface 3 – it really puts the latest Core M to shame – considering its 2x higher power rating as well as its much higher cost.

If you are after a new tablet, HTPC (like the NUC) or compute stick – best to wait for the new systems with CherryTrail – they are worth it! No ifs and no buts – these are the Atom APUs you are looking for.