Intel Atom X7 (CherryTrail 2015) CPU: Closing on Core M?

Intel Logo

What is CherryTrail (Braswell)?

“CherryTrail” (CYT) is the next generation Atom “APU” SoC from Intel (v3 2015) replacing the current Z3000 “BayTrail” (BYT) SoC which was Intel’s major foray into tablets (both Windows & Android). The “desktop” APUs are known as “Braswell” (BRS) while the APUs for other platforms have different code names.

BayTrail was a major update both CPU (OOS core, SSE4.x, AES HWA, Turbo/dynamic overclocking) and GPU (EV7 IvyBridge GPGPU core) so CherryTrail is a minor process shrink – but with a very much updated GPGPU – updated to EV8 (as latest Core Broadwell).

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding Intel CPU performance:

Hardware Specifications

We are comparing Atom processors with the Core M processors. New models may run at higher (or lower) frequencies than the ones they replace, thus performance delta can vary.

APU Specifications Atom Z3770 (BayTrail) Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) Comments
Cores (CU) / Threads (SP) 4C / 4T 4C / 4T 2C / 4T The new Atom still has 4C and no HT same as the old Atom; Core M has 2 cores but with HT so still 4 threads in total – we shall see whether this makes a big difference.
Speed (Min / Max / Turbo) 533-1467-2400 (6x-11x-18x) 480-1600-2400 (6x-20x-30x) 500-800-2000 (5x-8x-20x) The new CherryTrail Atom has lower BCLK (80MHz) compared to BayTrail (133MHz) and thus higher multipliers while LFM (Low), MFM (Rated) and Turbo (TFM) speeds are very similar.
Power (TDP) 2.4W 2.4W 4.5W TDP remains the same for the new Atom while Core M needs around 2x (twice) as much power.
L1D / L1I Caches 4x 24kB 6-way / 4x 32kB 8-way 4x 24kB 6-way / 4x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way No change in L1 caches on the new Atom; comparatively Core M has half the caches.
L2 Caches 2x 1MB 16-way 2x 1MB 16-way 4MB 16-way No change in L2 cache; here Core M has twice as much cache – same size as a normal i7 ULV.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Haswell introduces AVX2 which allows 256-bit integer SIMD (AVX only allowed 128-bit) and FMA3 – finally bringing “fused-multiply-add” for Intel CPUs. We are now seeing CPUs getting as wide a GP(GPUs) – not far from AVX3 512-bit in “Phi”.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Atom Z3770 (BayTrail) Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) Comments
          Intel Braswell: CPU Arithmetic
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 15.11 SSE4 35.91 SSE4 [+55%] 31.84 AVX2 [-12%] CherryTrail has no AVX2 but manages to be 55% faster than old Atom and even a big faster than Core M with AVX2! A great start!
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 11.09 AVX 18.43 AVX [+8.4%] 21.56 AVX/FMA [+17%] CherryTrail has no FMA either, but here it is only 8% faster than old Atom. Core M does better here – it’s 16% faster still.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 8.6 AVX 12.34 AVX [+2.8%] 13.49 AVX/FMA [+17%] With FP64 we see and even smaller 3% gain. Core M remains 16% faster still.
We see a big improvement with integer workload of 50% but minor 3-8% with floating-point. Sure Core M is faster (16%) but not by a huge amount to justify power and cost increase.
Intel Braswell: CPU Vectorised SIMD
CPU Multi-Media Vectorised SIMD Native Integer (Int32) Multi-Media (Mpix/s) 25.1 AVX 48.7 AVX [+25%] 70.8 AVX2 [+45%] Again CherryTrail has to do without AVX2, but is still 25% faster than old Atom – a decent improvement. Here though AVX2 of Core M has a sizeable 45% improvement.
CPU Multi-Media Vectorised SIMD Native Long (Int64) Multi-Media (Mpix/s) 9 AVX 14.5 AVX [+4.6%] 24.5 AVX2 [+69%] With a 64-bit integer workload, the improvement drops to about 5%. Here Core M with AVX2 is 70% faster still – finally a big improvement over Atom.
CPU Multi-Media Vectorised SIMD Native Quad-Int (Int128) Multi-Media (Mpix/s) 0.109 0.512 [+3.3x] 0.382 [-25%] This is a tough test using Long integers to emulate Int128, but here CherryTrail manages to be over 3x faster than old Atom – and even faster than Core M! Without SIMD the new Atom does much better than even Core M.
CPU Multi-Media Vectorised SIMD Native Float/FP32 Multi-Media (Mpix/s) 18.9 AVX 41.5 AVX [+42%] 61.3 FMA [+47%] In this floating-point AVX/FMA algorithm, CherryTrail does much better returning with a 42% improvement over old Atom. Core M also returns to a 47% improvement still – seems to be the par for SIMD ops.
CPU Multi-Media Vectorised SIMD Native Double/FP64 Multi-Media (Mpix/s) 8.18 AVX 15.9 AVX [+93%] 36.48 FMA [+2.3x] Switching to FP64 code, CherryTrail is now almost 2x as fast as old Atom – with Core M being over 2x faster still.
CPU Multi-Media Vectorised SIMD Native Quad-Float/FP128 Multi-Media (Mpix/s) 0.5 AVX 0.81 AVX [+62%] 1.69 FMA [+2.1x] In this heavy algorithm using FP64 to mantissa extend FP128, CherryTrail manages a 62% improvement and again Core M is over 2x still.
Lack of AVX2/FMA and higher Turbo speed prevents the new Atom from crushing the old Atom – but still allows a 25-100% (2x as fast) improvement, a significant improvement. Here, though, with AVX2 and FMA – Core M is 45-100% faster still thus could be worth it if more compute power is required.
Intel Braswell: CPU Crypto
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 1.09 AES HWA 1.44 AES HWA [+32%] 2.59 AES HWA [+79%] All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – here CherryTrail is 32% faster – thus we’d predict its memory controller can yield +35% more bandwidth. But even with just half no. cores Core M is 80% faster still – likely due to its dual-channel controller.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.51 AES HWA 2 AES HWA [+32%] ? AES HWA What we saw with AES-256 was no fluke: less rounds don’t make any difference, new Atom is 32% faster.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 0.143 AVX 0.572 AVX [+4x] 0.93 AVX2 [+63%] SHA HWA will come for the next Atom arch, so neither CPU supports it. But CherryTrail can manage to be a whopping 4x (four times) faster than old Atom. But Core M with AVX2 still manages to be 63% faster.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 0.388 AVX 1.17 AVX [+3x] ? AVX2 With a less complex algorithm – we see a 3x (three times) improvement over old Atom.
CherryTrail still misses AVX2 or forthcoming SHA HWA – but due to (expected) higher memory bandwidth it still improves over old Atom – while raw compute is 3-4x faster – a huge improvement. Due to its dual-channel controller and AVX2 Core M is still faster than it by a good 60-80%.
Intel Braswell: CPU Finance
CPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 8.0 14.15 [+76%] 21.88 [+54%] In this non-SIMD test we start with a good 76% improvement over old Atom – while Core M is 50% faster still. CherryTrail shows its prowess in FPU processing.
CPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7.3 11.66 [+59%] 17.6 [+50%] Switching to FP64 code, CherryTrail is still 60% faster – and Core M remains 50% faster still.
CPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1.76 3.85 [+2.2x] 5.1 [+32%] Binomial uses thread shared data thus stresses the cache & memory system; CherryTrail is over 2x (twice) as fast as old BayTrail – with Core M just 32% faster. It seems the new memory improvements do help a lot.
CPU Finance Benchmark Binomial double/FP64 (kOPT/s) 1.33 4.13 [+3.1x] 5.02 [+21%] With FP64 code CherryTrail is now a whopping 3x (three times) faster – a massive improvement. Core M is just 20% faster still.
CPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1.37 2.67 [+94%] 3.73 [+39%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; CherryTrail remains about 2x faster than old Atom – showing just how much the memory improvements help. Core M is still 40% faster.
CPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1.54 2.43 [+57%] 3.0 [+23%] Switching to FP64 the improvement drops to around 60% – still a big improvement. Core M’s improvement drops to 23%.
The financial tests show big improvements of 60-200% – complex compute tasks are now very much possible on Atom. Without help from AVX2 and FMA – Core M can only be 20-50% faster still, not the improvement we were hoping.
Intel Braswell: CPU Science
CPU Science Benchmark SGEMM (GFLOPS) float/FP32 6.19 AVX 12.42 AVX [+2x] 13.99 FMA [+12%] In this tough SIMD algorithm, again CherryTrail manages to be 2x as fast as old Atom; even with the help of FMA, Core M is only 12% faster. The new Atom is just running away with it.
CPU Science Benchmark DGEMM (GFLOPS) double/FP64 3.38 AVX 6.09 AVX [+80%] 9.61 FMA [+58%] With FP64 SIMD code, CherryTrail remains about 80% faster, just under 2x. Here Core M shows its power, it’s almost 60% faster still.
CPU Science Benchmark SFFT (GFLOPS) float/FP32 2.83 AVX 5.17 AVX [+82%] 4.11 FMA [-21%] FFT also uses SIMD and thus AVX but stresses the memory sub-system more: CherryTrail remains 82% faster than old Atom and here it beats even Core M.
CPU Science Benchmark DFFT (GFLOPS) double/FP64 2 AVX 2.83 AVX [+41%] 3.66 FMA [+29%] With FP64 code, the improvement is reduced to just 41% but it is still significant. Core M’s improvement drops to just 30%.
CPU Science Benchmark SNBODY (GFLOPS) float/FP32 0.929 AVX 1.58 AVX [+70%] 2.64 FMA [+67%] N-Body simulation is SIMD heavy but many memory accesses to shared data but CherryTrail still manages a 70% improvement, with Core M 70% faster still.
CPU Science Benchmark DNBODY (GFLOPS) double/FP64 1.25 AVX 1.76 AVX [+40%] 3.71 FMA [+2.1x] Unlike what we saw before, with FP64 code the improvement drops to 40% as with DFFT. Core M is over 2x faster still.
With highly optimised SIMD AVX code, we see a similar 40-100% improvement in performance; as we said before complex algorithms run a lot faster on the new Atom. With FP64 code, Core M’s FPU shows its power but at what cost?
Intel Braswell: CPU Multi-Core Transfer
CPU Multi-Core Benchmark Inter-Core Bandwidth (GB/s) 1.63 1.69 [+5%] 8.5 [+5x] With unchanged L1/L2 caches and similar rated/Turbo clocks – it is not surprising the new Atom does not improve on inter-core bandwidth. Core M shows it’s prowess – managing over 5x higher bandwidth. We see how all these caches perform in the Cache & Memory Atom (CherryTrail) performance article.
CPU Multi-Core Benchmark Inter-Core Latency (ns) 182 179 [-6%] 76 [-68%] Latency, however, sees a 5% decrease, likely due to the higher clock of CherryTrail during run (due to higher rated speed) since the caches are unchanged. Core M’s latencies seem to be 1/2 (half) of Atom.

While it does not bring any new instruction sets, CherryTrail improves significantly over the old Atom (30-100%) – much more than we’ve seen in Core series from arch to arch. It does not support new instruction sets (e.g. AVX2, FMA, SHA HWA) nor new caches – which perhaps it is just as well as Core M is not much faster.

Considering just how much BayTrail improved over the old Atom arch, the improvements are particularly impressive.

About the only “issue” is FP64 performance where Core M shows its power – whether using SIMD AVX/FMA or FPU code.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 8.x/10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64 SP1, latest Intel drivers. .Net 4.5.x, Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Atom Z3770 (BayTrail) Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) Comments
Intel Braswell: .Net Arithmetic
.Net Arithmetic Benchmark .Net Dhrystone (GIPS) 2.69 3.85 [+43%] 2.95 [-33%] .Net CLR performance improves by 43% – a great start and in line to what we saw with native code – again faster than Core M.
.Net Arithmetic Benchmark .Net Whetstone final/FP32 (GFLOPS) 2.91 9 [+3.1x] 10.8 [+20%] Floating-Point CLR performance improves by a whopping 3x (three times)! While apps should really do compute tasks in native code, this ensures that .Net (or Java) apps will fly. Core M is just 20% faster still.
.Net Arithmetic Benchmark .Net Whetstone double/FP64 (GFLOPS) 5.89 10 [+69%] 12.71 [+27%] FP64 CLR performance improves by a lower 70% but still great improvement. Core M’s performance improves almost 30%.
Just like native SIMD code, we see great improvement of 40-200% making CherryTrail significantly faster running .Net apps than old Atom. Core M is only 20-30% faster still.
Intel Braswell: .Net Vectorised
.Net Vectorised Benchmark .Net Integer Vectorised/Multi-Media (MPix/s) 6.42 11.34 [+76%] 11.25 [=] A good start, we see vectorised .Net code a huge 76% faster in CherryTrail – even faster than Core M!
.Net Vectorised Benchmark .Net Long Vectorised/Multi-Media (MPix/s) 0.761 6.36 [+8.3x] 5.62 [-12%] With 64-bit integer vectorised workload, we see a pretty unbelievable 8x improvement – the old Atom really has an issue with this test. As before, CherryTrail is faster than Core M.
.Net Vectorised Benchmark .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 0.922 2.6 [+2.82x] 4.14 [+59%] Switching to single-precision (FP32) floating-point code, CherryTrail is almost 3x faster. It seems that whatever workload you have in .Net, new Atom is a good deal faster. Core M is 60% faster still.
.Net Vectorised Benchmark .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 2.52 6.25 [+2.48x] 9.36 [+49%] Switching to FP64 code, CherryTrail’s improvement drops to 2.5x – but Core M 50% faster still! While unlikely compute tasks are written in .Net rather than native code, small compute code does benefit.
Vectorised .Net improves even better, between 80-200% whether integer or floating-point workload. Unlike what we saw when we reviewed the Core CPUs, Atom does improve from arch to arch.

Just like native code, we see a huge 40-200% improvement when running .Net or Java code. Any “modern” apps (either Universal, Metro, WPF) should run very much faster – especially as heavily optimised SIMD code has also been shown to be a lot faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Despite being just a process shrink with minor improvements, CherryTrail is very much faster (25-100%) than the old Atom (BayTrail) within the same power envelope (TDP). Let’s not forget BayTrail itself improved greatly over the previous Atoms – thus new Atom is light-years away in performance. And that’s before we mention the very much improved EV8 GPGPU whose performance we saw earlier, see CPU Atom Z8700 (CherryTrail) GPGPU performance.

Any major CPU improvements including instruction set support (AVX2, FMA, SHA HWA) will have to wait for the next Atom arch – which should make it a formidable APU – but what we have here is still significant.

Coupled with its much improved GPGPU performance, we can now see just why Microsoft has selected it for Surface 3 – it really puts the latest Core M to shame – considering its 2x higher power rating as well as its much higher cost.

If you are after a new tablet, HTPC (like the NUC) or compute stick – best to wait for the new systems with CherryTrail – they are worth it! No ifs and no buts – these are the Atom APUs you are looking for.

Intel Atom X7 (CherryTrail 2015) GPGPU: Closing on Core M?

Intel Logo

What is CherryTrail (Braswell)?

“CherryTrail” (CYT) is the next generation Atom “APU” SoC from Intel (v3 2015) replacing the current Z3000 “BayTrail” (BYT) SoC which was Intel’s major foray into tablets (both Windows & Android). The “desktop” APUs are known as “Braswell” (BRS) while the APUs for other platforms have different code names.

BayTrail was a major update both CPU (OOS core, SSE4.x, AES HWA, Turbo/dynamic overclocking) and GPU (EV7 IvyBridge GPGPU core) so CherryTrail is a minor process shrink – but with a very much updated GPGPU – updated to EV8 (as latest Core Broadwell).

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding Intel GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of 3 processors (BayTrail, CherryTrail and Broadwell-Y) that support GPGPU.

Graphics Unit BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comment
Graphics Core B-GT EV7 B-GT EV8? B-GT2Y EV8 CherryTrail’s GPU is meant to be based on EV8 like Broadwell – the very latest GPGPU core from Intel! This makes it more advanced than the very popular Core Haswell series, a first for Atom.
APU / Processor Atom Z3770 Atom X7 Z8700 Core M 5Y10 Core M is the new Core-Y UULV versions for high-end tablets against the Atom processors for “normal” tablets/phones.
Cores (CU) / Shaders (SP) / Type 4C / 32SP (2×4 SIMD) 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) Here’s the major change: CherryTrail has no less than 4 times the compute units (CU) in the same power envelope of the old BayTrail. Broadwell has more (24) but it is also rated at higher TDP.
Speed (Min / Max / Turbo) MHz 333 – 667 200 – 600 200 – 800 CherryTrail goes down all the way to 200MHz (same as Broadwell) which should help power savings. Its top speed is a bit lower than BayTrail but not by much.
Power (TDP) W 2.4 (under 4) 2.4 (under 4) 4.5 Both Atoms have the same TDP of around 2-2.4W – while Broadwell-Y is rated at 2x at 4.5-6W. We shall see whether this makes a difference.
DirectX / OpenGL / OpenCL Support 11 / 4.0 / 1.1 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 Intel has continued to improve the video driver – 2 generations share a driver – but here CherryTrail has a brand-new driver that supports much newer technologies like DirectX 11.1 (vs 11.0), OpenGL 4.3 (vs 4.0) including Compute and OpenCL 1.2. Broadwell’s driver does support OpenCL 2.0 – perhaps a later CherryTrail driver will do too?
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX) No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) Sadly FP16 support is still missing and FP64 is also missing on OpenCL – but available in DirectX Compute as well as OpenGL Compute! Those Intel FP64 extensions are taking their time to appear…

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (Jun 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: GPGPU Vectorised
GPGPU Arithmetic Benchmark Single/Float/FP32 Vectorised OpenCL (Mpix/s) 25 160 [+6.4x] 181 [+13%] Straight off the bat we see that 4x more advanced CUs in CherryTrail gives us 6.4x better performance a huge improvement! Even the brand-new Broadwell GPU is only 13% faster.
GPGPU Arithmetic Benchmark Half/Float/FP16 Vectorised OpenCL (Mpix/s) 25 160 [+6.4x] 180 [+13%] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
GPGPU Arithmetic Benchmark Double/FP64 Vectorised OpenCL (Mpix/s) 1.63 (emulated) 10.1 [+6.2x] (emulated) 11.6 [+15%] (emulated) None of the GPUs support native FP64 either: emulating FP64 (mantissa extending) is quite hard on all GPUs, but the results don’t change: CherryTrail is 6.2x faster with Broadwell just 15% faster.
GPGPU Arithmetic Benchmark Quad/FP128 Vectorised OpenCL (Mpix/s) 0.18 (emulated) 1.08 [+6x] (emulated) 1.32 [+22%] (emulated) Emulating FP128 using FP32 is even more complex but CherryTrail does not disappoint, it is still 6x faster; Broadwell does pull ahead a bit being 22% faster.
Intel Braswell: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 96 825 [+8.6x] 770 [-7%] In this tough integer workload that uses shared memory CherryTrail does even better – it is 8.6x faster, more than we’d expect – the newer driver may help. Surprisingly this is faster than even Broadwell.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 129 1105 [+8.6x] n/a What we saw before is no fluke, CherryTrail’s GPU is still 8.6 times faster than BayTrail’s.
Intel Braswell: GPGPU Hashing
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 54 309 [+5.7x] This 64-bit integer compute-heavy wokload is hard on all GPUs (no native 64-bit arithmetic), but CherryTrail does well – it is almost 6x faster than the older BayTrail. Note that neither DirectX nor OpenGL natively support int64 so this is about as hard as it gets for our GPUs.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 96 1187 [+12.4x] 1331 [+12%] In this integer compute-heavy workload, CherryTrail really shines – it is 12.4x (twelve times) faster than BayTrail! Again, even the latest Broadwell is just 12% faster than it! Atom finally kicks ass both in CPU and GPU performance.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 215 2764 [+12.8x] SHA1 is less compute-heavy, but results don’t change: CherryTrail is 12.8x times faster – the best result we’ve seen so far.
Intel Braswell: GPGPU Finance
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 34.33 299.8 [+8.7x] 280.5 [-7%] Starting with the financial tests, CherryTrail is quick off the mark – being almost 9x (nine times) faster than BayTrail – and again somehow faster than even Broadwell. Who says Atom cannot hold its own now?
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 5.16 28 [+5.4x] 36.5 [+30%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – but CherryTrail still holds its own, it’s over 5x times faster – not as much as we saw before but massive improvement. Broadwell’s EV8 GPU does show its prowess being 30% faster still.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 3.6 61.9 [+17x] 54 [-12%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; here we see Broadwell shine – it’s 17x (seventeen times) faster than BayTrail’s GPU – so much so we had to recheck. Most likely the newer GPU driver helps – but BayTrail will not get these improvements. Broadwell is again surprisingly 12% slower.
Intel Braswell: GPGPU Science
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 6 45 [+7.5x] 44.1 [-3%] GEMM is quite a tough algorithm for our GPUs but CherryTrail remains over 7.5x faster – even Broadwell is 3% slower than it. We saw before EV8 not performing as we expected – perhaps some more optimisations are needed.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 2.27 9 [+3.96x] 8.94 [-1%] FFT involves many kernels processing data in a pipeline – so here we see CherryTrail only 4x (four times) faster – the slowest we’ve seen so far. But then again Broadwell scores about the same so it’s a tough test for all GPUs.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 10.74 65 [+6x] 50 [-23%] In our last test we see CherryTrail going back to being 6x faster than BayTrail – surprisingly again Broadwell’s EV8 GPU is 23% slower than it.

There is no simpler way to put this: CherryTrail Atom’s GPU obliterates the old one – never being less than 4x and up to 17x (yes, seventeen!) faster, many times even overtaking the much newer, more expensive and more power hungry Broadwell (Core M) EV8 GPU! It is a no-brainer really, you want it – for once Microsoft made a good choice for Surface 3 after the disasters of earlier Surfaces (perhaps they finally learn? Nah!).

There isn’t really much to criticise: sure, FP16 native support is missing – which is a pity on Android (that uses FP16 in UX) and naturally FP64 is also missing – though as usual DirectX compute and OpenGL compute. As mentioned, since OpenGL 4.3 is supported, Compute is also supported for the first time on Atom – a feature recently introduced in newer drivers on Haswell and later GPUs (EV7.5, EV8).

Just in case we’re not clear: this *is* the Atom you are looking for!

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) Same transcoder is used for all GPUs.
Intel Braswell: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 2.24 5 [+2.23x] 8.41 [+68%] H.264 transcoding on the new Atom has more than doubled (2.2x) which makes it ideal as a HTPC (e.g. Plex server). However, with more power we can see that Core M has over 60% more bandwidth.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 1.89 4.75 [+2.51x] 8.2 [+70%] When just using the H264 encoder we still see a 2.5x improvement (over two and a half times), with Core M again about 70% faster still.

Intel has not forgotten transcoding, with the new Atom over 2x (twice) as fast – so if you were thinking of using it as a HTPC (NUC/Brix) server, better get the new one. However, unless you really want low power – the Core M (and thus ULV) versions have are 60-70% faster still…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Memory Configuration 2GB DDR3 1067MHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) Atom is generally configured to use a single memory controller, but CherryTrail runs at 1.6Mt/s same as modern Core APUs. But Core M/Broadwell naturally has a dual-channel controller though some laptops/tablets may use just one.
Cache Configuration 32kB L2 global/texture? 128kB L3 256kB L2 global/texture? 384kB L3 256kB L2 global/texture? 384kB L3 Internal cache arrangement seems to be very secret – so a lot of it is deduced from the latency graphs. The L2 increase in CherryTrail is in line with CU increase, i.e. 8x larger.
Intel Braswell: GPGPU Memory Bandwidth
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 3.45 11 [+3.2x] 10.1 [-8%] CherryTrail manages over 3x higher bandwidth during internal transfer over BayTrail, close to what we’d expect a dual-channel system to achieve. Surprisingly our dual-channel Core M manages to be 8% slower. We did see Broadwell achieve less than Haswell – which may explain what we’re seeing here.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 1.18 2.09 [+77%] 3.91 [+87%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Here CherryTrail improves almost 2x over BayTrail – but finally we see Core M being 87% faster still.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 1.13 2.29 [+2.02x] 3.79 [+65%] While upload bandwidth was the same, download bandwidth has improved a bit more, with CherryTrail being over 2x (twice) faster – but again Broadwell is 65% faster still. This will really help GPGPU applications that need to copy large results from the GPU to CPU memory until “zero copy” feature arrives.
Intel Braswell: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 981 829 [-15%] 274 With the memory running faster, we see latency decreasing by 15% a good result. However, Broadwell does so much better with almost 1/4 latency.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1272 1669 [+31%] Surprisingly, using full-random access we see latency increase by 31%. This could be due to the larger (4GB vs. 2GB) memory arrangement – the TLB-miss hit could be much higher.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 383 279 [-27%] Sequential access brings the latency down by 27% – a good result.
GPGPU Memory Latency Constant Memory Latency (ns) 660 209 [-1/3x] With L1 cache covering the entire constant memory on CherryTrail – we see latency decrease to 1/3 (a third), great for kernels that use more than 32kB constant data.
GPGPU Memory Latency Shared Memory Latency (ns) 215 201 [-6%] Shared memory is a bit faster (6% lower latency), nothing to write home about.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1583 1234 [-22%] With the memory running faster, as with global memory we see latency decreasing by 22% here – a good result!
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 916 353 [-61%] Sequential access brings the latency down by a huge 61%, an even bigger difference than what we saw with Global memory. Impressive!

Again, we see big gains in CherryTrail with bandwidth increasing by 2-3x which is necessary to keep all those new EVs fed with data; Broadwell does do better but then again it has a dual-channel memory controller.

Latency has also decreased by a good amount 6-22% likely due to the faster memory employed, and the much larger caches (8x) do help. For data that exceeded the small BayTrail cache (32kB) – the CherryTrail one should be more than sufficient.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (Jun 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: Video Shaders
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 39 127.6 [+3.3x] Starting with DirectX FP32, CherryTrail is over 3.3x faster than BayTrail – not as high as we saw before but a good start.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 38.3 121.8 [+3.2x] 172 [+41%] OpenGL does not change matters, CherryTrail is still just over 3x (three times) faster than BayTrail. Here, though, Broadwell is 41% faster still…
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 39.2 109.5 [+2.8x] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 38.11 121.8 [+3.2x] 170 [+39%] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 7.48 18 [+2.4x] Unlike OpenCL driver, DirectX driver does support FP64 – so all GPUs run native FP64 code not emulation. Here, CherryTrail is only 2.4x faster than BayTrail.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 9.17 26 [+2.83x] 46.45 [+78%] As above, OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. CherryTrail is 2.8x times faster here, but Broadwell is 78% faster still.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.3 (emulated) 1.34 [+3%] (emulated) (emulated) Here we’re emulating (mantissa extending) FP128 using FP64 not FP32 but it’s hard: CherryTrail’s performance falls to just 3% faster over BayTrail, perhaps some optimisations are needed.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1 (emulated) 1.1 [+10%] (emulated) 3.4 [+3.1x] (emulated) OpenGL does not change the results – but here we see Broadwell being 3x faster than both CherryTrail and BayTrail. Perhaps such heavy shaders are too much for our Atom GPUs.

Unlike GPGPU, here we don’t see the same crushing improvement – but CherryTrail’s GPU is still about 3x (three times) faster than BayTrail’s – though Broadwell shows its power. Perhaps our shaders are a bit too complex for pixel processing and should rather stay in the GPGPU field…

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: Video Memory Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 6.74 11.18 [+65%] 12.46 [+11%] DirectX bandwdith is not as “bad” as OpenCL on BayTrail (better driver?) so we start from a higher baseline: CherryTrail still manages 65% more bandwidth – with Broadwell only squeezing 11% more despite its dual-channel. It shows that OpenCL GPGPU driver has come a long way to match DirectX.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.62 2.83 [+8%] 5.29 [+87%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again BayTrail does better so CherryTrail can only be 8% faster than it – with Broadwell finally 87% faster.
Video Memory Benchmark Download Bandwidth (GB/s) 1.14 2.1 [+83%] 1.23 [-42%] Here BayTrail “stumbles” so CherryTrail can be 83% faster with Broadwell surprisingly 42% slower. What it does show is that the CherryTrail drivers are better despite being much newer. It is a pity Intel does not provide this driver for BayTrail too…

Again, we see big gains in CherryTrail with bandwidth increasing by 2-3x which is necessary to keep all those new EVs fed with data; Broadwell does do better but then again it has a dual-channel memory controller.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Here we tested the brand-new Atom X7 Z8700 (CherryTrail) GPU with 16 EVs (EV8) – 4x (four times) more than the older Atom Z3700 (BayTrail) GPU with 4 EVs (EV7) – so we expected big gains – and they were delivered: GPGPU performance is nothing less than stellar, obliterating the old Atom GPU to dust – no doubt also helped by the newer driver (which sadly BayTrail won’t get). And all at the same TDP of about 2.4-5W! Impressive!

Even the very latest Core M (Broadwell) GPU (EV8) is sometimes left behind – at about 2x higher power, more EVs and higher cost – perhaps the new Atom is too good?

Architecturally nothing much has changed (beside the far more EVs) – but we also get better bandwidth and lower latencies – no doubt due to higher memory bus clock.

All in all, there’s no doubt – this new Atom is the one to get and will bring far better graphics and GPGPU performance at low cost – even overshadowing the great Core M series – no doubt the reason it is found in the latest Surface 3.

To see how the Atom CherryTrail CPU fares, please see CPU Atom Z8700 (CherryTrail) performance article!