AVX512 performance improvement for SKL-X in Sandra SP2

Intel Skylake-X Core i9

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA3/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators – albeit in a somewhat different form – it has finally made it to its CPU lines with Skylake-X (SKL-X/EX/EP) – for now HEDT (i9) and Server (Xeon) – and hopefully to mainstream at some point.

Note it is rumoured the current Skylake (SKL)/Kabylake (KBL) are also supposed to support it based on core changes (widening of ports to 512-bit, unit changes, etc.) – nevertheless no public way of engaging them has been found.

AVX512 consists of multiple extensions and not all CPUs (or GPGPUs) may implement them all:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit. [supported by SKL-X, Phi]
  • AVX512DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers [supported by SKL-X]
  • AVX512CD – Conflict Detection – loop vectorisation through predication [not supported by SKL-X but Phi]
  • AVX512ER – Exponential & Reciprocal – transcedental operations [not supported by SKL-X but Phi]
  • more sets will be introduced in future versions

As with anything, simply doubling register width does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains – some of which may require future arch or even tools to realise their true potential.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks SKL-X AVX512 SKL-X AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  1460 [+23%]  1180 For integer workloads we manage only 23% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  519 [+19%]  435 With a 64-bit integer workload the improvement reduces to 19%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  7.72 [=]  7.62 No SIMD here
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+80%]  1000 In this floating-point test we finally see the power of AVX512 – it is 80% faster than AVX2/FMA3 – a huge improvement.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  1150 [+85%]  622 Switching to FP64 increases the improvement to 85% a huge gain.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  36 [+50%]  24 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 50% improvement still nothing to ignore.
AVX512 cannot bring 100% improvement but does manage up to 85% improvement – a no mean feat! While integer workload is only 20-25% it is still decent. Heavy compute algorithms will greatly benefit from AVX512.
BenchCrypt Crypto SHA2-256 (GB/s) 26 [+78%]  14.6 With no data dependency – we get good scaling of almost 80% even with this integer workload.
BenchCrypt Crypto SHA1 (GB/s)  39.8 [+51%]  26.4 Here we see only 50% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s)  21.2 [+94%]  10.9 With 64-bit integer workload we see almost perfect scaling of 94%.
As we work on different buffers and have no dependencies, AVX512 brings up to 94% performance improvement – only limited by memory bandwidth with even 4 channel DDR4 @ 3200Mt/s not enough for 10C/20T CPU. AVX512 is absolutely worth it to drive the system to the limit.
BenchScience SGEMM (GFLOPS) float/FP32  558 [-7%]  605 Unfortunately the current compiler does not seem to help.
BenchScience DGEMM (GFLOPS) double/FP64  235 [+2%]  229 Changing to FP64 at least allows AVX512 to with by a meagre 2%.
BenchScience SFFT (GFLOPS) float/FP32  35.3 [=]  35.3 Again the compiler does not seem to help here.
BenchScience DFFT (GFLOPS) double/FP64  19.9 [-2%]  20.2 With FP64 nothing much happens.
BenchScience SNBODY (GFLOPS) float/FP32  585 [-1%]  591 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64  175 [-1%]  178 With FP64 workload nothing much changes.
With complex SIMD code – not written in assembler the compiler has some ways to go and performance is not great. But at least the performance is not worse.
CPU Image Processing Blur (3×3) Filter (MPix/s)  3830 [+60%]  2390 We start well here with AVX512 60% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  1700 [+70%]  1000 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  885 [+56%]  566 Again same algorithm but even more data shared now brings the improvement down to 56%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  1290 [+56%]  826 Using two buffers does not change much still 56% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  136 [+59%]  85 Different algorithm keeps the AVX512 advantage the same at about 60%.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  65.6 [+31.7%]  49.8 Using the new scatter/gather in AVX512 still brings 30% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  3920 [+3%]  3800 Here we have a 64-bit integer workload algorithm with many gathers with AVX512 likely memory latency bound thus almost no improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  770 [+2%]  755 Again loads of gathers does not allow AVX512 to shine but still decent performance
As with other SIMD tests, AVX512 brings between 60-70% performance increase, very impressive. However in algorithms that involve heavy memory access (scatter/gather) we are limited by memory latency and thus we see almost no delta but at least it is not slower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that even for a 1st-generation CPU with AVX512 support, SKL-X greatly benefits from the new instruction set – with anything between 50-95% performance improvement. However compiler/tools are raw (VC++ 2017 only added support in the recent 15.3 version) and performance sketchy where hand-crafted assembler is not used. But these will get better and future CPU generations (CFL-X, etc.) will likely improve performance.

Also let’s remember that some SKUs have 2x FMA (aka 512-bit) (and other instructions) licence – while most SKUs have only 1x FMA (aka 256-bit); the former SKUs likely benefit even more from AVX512 and it is something Intel may be more generous in enabling in future generations.

In algorithms heavily dependent on memory bandwidth or latency AVX512 cannot work miracles, but at least will extract the maximum possible compute performance from the CPU. SKUs with lower number of cores (8, 6, 4, etc.) likely to gain even more from AVX512.

In addition let’s not forget the “Phi” accelerators – that also support AVX512 – thus porting code will allow great performance on many-core (MIC) architecture too.

NUMA performance improvement for ThreadRipper in Sandra SP2

What is NUMA?

Modern CPUs have had a built-in memory controller for many years now, starting with the K8/Opteron, in order to higher better bandwidth and lower latency. As a result in SMP systems each CPU has their own memory controller and its own system memory that it can access at high speed – while to access other memory it must send requests to the other CPUs. NUMA is a way of describing such systems and allow the operating system and applications to allocate memory on the node they are running on for best performance.

As ThreadRipper is really two (2) Ryzen dies connected internally through InfinityFabric – it is basically a 2-CPU SMP system and thus a 2-node NUMA system.

While it is possible to configure it in UMA (Uniform Memory Access mode) where all memory appears to be unified and interleaved between nodes, for best performance the NUMA mode is recommended when the operating system and applications support it.

While Sandra has always supported NUMA in the standard benchmarks – some of the new benchmarks have not been updated with NUMA support especially since multi-core systems have pretty much killed SMP systems on the desktop – with only expensive severs left to bring SMP / NUMA support.

Note that all the NUMA improvements here would apply to competitor NUMA (e.g. Intel) systems, thus it is not just for ThreadRipper – with EPYC systems likely showing a far higher improvement too.

In this article we test NUMA performance; please see our other articles on:

Native Performance

We are testing native performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks NUMA 2-nodes
UMA single-node
Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  965 [+2.8%]  938 The ‘lightest’ workload should show some NUMA overhead but we can only manage 3% here.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  312 [+2.3%] 305 With a 64-bit integer workload the improvement drops to 2%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  10.9 [=]  10.9 Emulating int128 means far increased compute workload with NUMA overhead insignificant.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s)  997 [+1.2%]  985 Again no measured improvement here.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  562 [+1%]  556 Again no measured improvement here.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  27 [=]  26.85 In this heavy algorithm using FP64 to mantissa extend FP128 we see no improvement.
Fractals are compute intensive with few memory accesses – mainly to store results – thus we see a maximum of 3% improvement with NUMA support with the rest insignificant. However, this is a simple 2-node system – bigger 4/8-node systems would likely show bigger gains.
BenchCrypt Crypto AES256 (GB/s) 27.1 [+139%] 11.3 AES hardware accelerated is memory bandwidth bound thus NUMA support matters; even in this 2-node system we see a over 2x improvement of 139%!
BenchCrypt Crypto AES128 (GB/s) 27.4 [+142%] 11.3 Similar to above we see a massive 142% improvement by allocating memory on the right NUMA node.
BenchCrypt Crypto SHA2-256 (GB/s)  32.3 [+50%] 21.4 SHA is also hardware accelerated but operates on a single input buffer (with a small output hash value buffer) and here out improvement drops to 50%, still very much significant.
BenchCrypt Crypto SHA1 (GB/s) 34.2  [+56%]  21.8 Similar to above we see an even larger 56% improvement for supporting NUMA.
BenchCrypt Crypto SHA2-512 (GB/s)  6.36 [=]  6.35 SHA2-256 is not hardware accelerated (AVX2 used) but heavy compute bound thus our improvement drops to nothing.
Finally in streaming algorithms we see just how much NUMA support matters: even on this 2-note system we see over 2x improvement of 140% when working with 2 buffers (in/out). When using a single buffer our improvement drops to 50% but still very much significant. TR needs NUMA suppport to shine.
BenchScience SGEMM (GFLOPS) float/FP32  395 [114%]  184 As with crypto, GEMM benefits greatly from NUMA support with an incredible 114% improvement by allocating the (3) buffers on the right NUMA nodes.
BenchScience DGEMM (GFLOPS) double/FP64  183 [131%]  79 Changing to FP64 brings an even more incredible 131%.
BenchScience SFFT (GFLOPS) float/FP32  11.6 [86%]  6.25 FFT also shows big gains from NUMA support with 86% improvement just by allocating the buffers (2+1 const) on the right nodes.
BenchScience DFFT (GFLOPS) double/FP64  10.6 [112%]  5 With FP64 again increases
BenchScience SNBODY (GFLOPS) float/FP32  479 [=]  483 Strangely N-Body does not benefit much from NUMA support with no appreciable improvement.
BenchScience DNBODY (GFLOPS) double/FP64  189 [=]  191 With FP64 workload nothing much changes.
As with crypto, buffer heavy algorithms (GEMM, FFT, N-Body) greatly benefit from NUMA support with performance doubling (86-131%) by allocating on the right NUMA nodes; in effect TR needs NUMA in order to perform better than a standard Ryzen!
CPU Image Processing Blur (3×3) Filter (MPix/s)  2090 [+71%]  1220 Least compute brings highest benefit from NUMA support – here it is 71%.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  886 [=]  890 Same algorithm but more compute brings the improvement to nothing.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  494 [=]  495 Again same algorithm but even more compute again no benefit.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  720 [=]  719 Using two buffers does not seem to show any benefit either.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  116 [=]  117 Different algorithm keeps with more compute means no benefit either.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  40.3 [=]  40.7 Using the new scatter/gather in AVX2 does not help matters even with NUMA support.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1880 [+90%]  982 Here we have a 64-bit integer workload algorithm with many gathers not compute heavy brings 90% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  397 [=]  396 Heavy compute brings down the improvement to nothing.
As with other SIMD tests,  low compute algorithms see 70-90% improvement from NUMA support; heavy compute algorithms bring the improvement down to zero. It all depends whether the overhead of accessing other nodes can be masked by compute; in effect TR seems to perform pretty well.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that ThreadRipper needs NUMA support in applications – just like any other SMP system today to shine: we see over 2x improvement in bandwidth-heavy algorithms. However, in compute-heavy algorithms TR is able to mask the overhead pretty well – with NUMA bringing almost no improvement. For non NUMA supporting software the UMA mode should be employed.

Let’s remember we are only testing a 2-node system, here, a 4+ node system is likely to show higher improvements and with EPYC systems stating at 1-socket 4-node we can potentially have common 4-socket 16-node systems that absolutely need NUMA for best performance. We look forward to testing such a system as soon as possible 😉

 

AMD Threadripper Review & Benchmarks – CPU 16-core Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on.

AMD Epyc/Threadripper DieIn this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Threadripper (1950X) with HEDT competition (Intel SKL-X) as well as normal desktop solutions (Ryzen, Skylake) which also serves to compare HEDT with the “normal” desktop solution.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
Cores (CU) / Threads (SP) 16C / 32T 10C / 20T 8C / 16T 4C / 8T Just as Ryzen, TR has the most cores though Intel has just announced new SKL-X with more cores.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 0.8-4.0-4.2GHz (8x-40x-42x) SKL has the highest base clock but all CPUs have similar Turbo clocks
Power (TDP) 180W 150W 95W 91W TR has higher TDP than SKL-X just like Ryzen so may need a beefier cooling system
L1D / L1I Caches 16x 32kB 8-way / 16x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way TR and Ryzen’s instruction caches are 2x data (and SKL/X) but all caches are 8-way.
L2 Caches 16x 512kB 8-way (8MB total) 20x 1MB 16-way (20MB total) 8x 512kB 8-way (4MB total) 4x 256kB 8-way (1MB total) SKL-X has really pushed the boat out with a 1MB L2 cache that dwarfs all other CPUs.
L3 Caches 4x 8MB 16-way (32MB total) 13.75MB 11-way 2x 8MB 16-way (16MB total) 8MB 16-way TR actually has 2 sets of 2 L3 caches rather than a combined L3 cache like SKL/X.
NUMA Nodes
2x 16GB each no, unified 32GB no, unified 16GB no, unified 16GB Only TR has 2 NUMA nodes

Thread Scheduling and Windows

Threadripper’s topology (4 cores in each CCX, with 2 CCX in one node and 2 nodes) makes things even more compilcated for operating system (Windows) schedulers. Effectively we have a 2-tier NUMA SMP system where CCXes are level 1 and nodes are level 2 thus the scheduling of threads matters a lot.

Also keep in mind this is a NUMA system (2 nodes) with each node having its own memory; while for compatibility AMD recommends (and the BIOS defaults) to “UMA” (Unified) “interleaving across nodes” – for best performance the non-interleaving mode (or “interleaving across CCX”) should be used.

What all this means is that you likely need a reasonably new operating system – thus Windows 10 / Server 2016 – with a kernel that has been updated to support Ryzen/TR as Microsoft is not likely to care about old verions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen/TR support all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 447 [-2%] 454 226 186 TR can keep up with SKL-X and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+1%] 456 236 184 An Int64 load does not change results.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 352 [+30%] 269 184 107 Finally TR soundly beats SKL-X by 30% and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 295 [+32%] 223 154 89 With a FP64 work-load the lead inceases slightly.
Unlike Ryzen which soundly dominated Skylake (albeit with 2x more cores, 8 vs. 4), Threadripper does not have the same advantage (16 vs. 10) thus it can only beat SKL-X in floating-point work-loads where it is 30% faster, still a good result.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 918 [-22%] 1180 535 527 With AVX2/FMA SKL-X is just too strong, with TR 22% slower.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 307 [-29%] 435 161 191 With Int64 AVX2 TR is almost 20% slower than SKL-X.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 7 [+30%] 5.4 3.6 2 This is a tough test using Long integers to emulate Int128 without SIMD and here TR manages to be 30 faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 996 [=] 1000 518 466 In this floating-point AVX2/FMA vectorised test  TR manages to tie with SKL-X.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 559 [-10%] 622 299 273 Switching to FP64 SIMD code, TR is now 10% slower than SKL-X.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 27 [+12%] 24 13.7 10.7 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – TR manages a 12% win.
In vectorised AVX2/FMA code we see TR lose in most tests, or tie in one – and only shine in emulation tests not using SIMD instruction sets. Intel’s SIMD units – even without AVX512 that SKL-X brings – are just too strong for TR just as we saw Ryzen struggle against normal Skylake.
BenchCrypt Crypto AES-256 (GB/s) 27.1 [-21%] 34.4  14  15 All CPUs support AES HWA – but TR/Ryzen memory is just 2400Mt/s vs 3200 that SKL-X enjoys (+33%) thus this is a good result; TR seems to use its channels pretty effectively.
BenchCrypt Crypto AES-128 (GB/s)  27.4 [-18%]  33.5  14  15 Similar to what we saw above TR is just 18% slower which is a good result; unfortunately we cannot get the memory over 2400Mt/s.
BenchCrypt Crypto SHA2-256 (GB/s)  32.2 [+2.2x]
 14.6  17.1  5.9 Like Ryzen, TR’s secret weapon is SHA HWA which allows it to soundly beat SKL-X over 2.2x faster!
BenchCrypt Crypto SHA1 (GB/s) 34.2 [+30%] 26.4  17.7  11.3 Even with SHA HWA, the multi-buffer AVX2 implementation allows SKL-X to beat TR by 16% but it still scores well.
BenchCrypt Crypto SHA2-512 (GB/s)  6.34 [-41%]  10.9  3.35  4.38 SHA2-512 is not accelerated by SHA HWA (version 1) thus TR has to use the same vectorised AVX2 code thus is 41% slower.
TR’s secret crypto weapon (as Ryzen) is SHA HWA which allows it to soundly beat SKL-X even with 33% less memory bandwidth; provided software is NUMA-enabled it seems TR can effectively use its 4-channel memory controllers.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 436 [+35%] 322  234.6  129 In this non-vectorised test TR bets SKL-X by 35%. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s)  366 [+32%]
277  198.6  109 Switching to FP64 code,TR still beats SKL-X by over 30%. So far so great.
BenchFinance Binomial float/FP32 (kOPT/s)  165 [+2.46x]
 67.3  85.6  27.25 Binomial uses thread shared data thus stresses the cache & memory system; we would expect TR to falter – but nothing of the sort – it is actually over 2.5x faster than SKL-X leaving it in the dust!
BenchFinance Binomial double/FP64 (kOPT/s)  83.7 [+27%]
 65.6  45.6  25.54 With FP64 code the situation changes somewhat – TR is only 27% faster but still an appreciable lead. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s)  91.6 [+42]
 64.3  49.1  25.92 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; TR reigns supreme being 40% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s)  68.7 [+34%]
 51.2  37.1  19 Switching to FP64, TR is just 34% faster but still a good lead
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads TR soundly beats SKL-X by a big margin that even a 16-core version may not be able to match. But should these tests be vectorisable using SIMD – especially AVX512 – then we would fully expect Intel to win. But for now – for financial workloads there is only one choice: TR/Ryzen!!!
BenchScience SGEMM (GFLOPS) float/FP32  165 [?] 623  240.7  268 We need to implement NUMA fixes here to allow TR to scale.
BenchScience DGEMM (GFLOPS) double/FP64  75.9 [?]  216  102.2  92.2 We need to implement NUMA fixes here to allow TR to scale.
BenchScience SFFT (GFLOPS) float/FP32  16.6 [-51%]  34.3  8.57  19 FFT is also heavily vectorised but stresses the memory sub-system more; here TR cannot beat SKL-X and is 50% slower – but scales well against Ryzen.
BenchScience DFFT (GFLOPS) double/FP64  8 [-65%]  23.18  7.6  11.13 With FP64 code, the gap only widens with TR over 65% slower than SKL-X and little scaling over Ryzen.
BenchScience SNBODY (GFLOPS) float/FP32  456 [-22%]  587  234  272 N-Body simulation is vectorised but has many memory accesses to shared data – and here TR is only 22% slower than SKL-X but again scales well vs Ryzen.
BenchScience DNBODY (GFLOPS) double/FP64  173 [-2%]  178  87.2  79.6 With FP64 code TR almost catches up with SKL-X
With highly vectorised SIMD code TR cannot do as well – but an additional issue is that NUMA support needs to be improved – F/D-GEMM shows how much of a problem this can be as all memory traffic is using a single NUMA node.
CPU Image Processing Blur (3×3) Filter (MPix/s)  1470 [-6%] 1560  775  634 In this vectorised integer AVX2 workload TR does surprisingly well against SKL-X just 6% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  617 [-10%]  693  327  280 Same algorithm but more shared data used sees TR now 10%, more NUMA optimisations needed.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  361 [-6%]  384  192  154 Again same algorithm but even more data shared now TR is 6% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  570 [-6%]  609  307  271 Different algorithm but still AVX2 vectorised workload – TR is still 6% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  106 [+35%]  78.3  57.3  34.9 Still AVX2 vectorised code but TR does far better, it is no less than 35% faster than SKL-X!
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  37.8 [-17%]  45.8  20  18.1 TR does worst here, it is 17% slower than SKL-X but still scales well vs. Ryzen.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1260 [?]  4260  1160  2280 This 64-bit SIMD integer workload is a problem for TR but likely NUMA issue again as not much scaling vs. Ryzen.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 420 [-45%]  777  175  359 TR really does not do well here but does scale well vs. Ryzen, likely some code optimisation is needed.

As TR (like Ryzen) supports most modern instruction sets now (AVX2, FMA, AES/SHA HWA) it does well but generally not enough to beat SKL-X; unfortunately the latter with AVX512 can potentially get even faster (up to 100%) increasing the gap even more.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) – even more memory bandwidth is needed to feed all the 16 cores (32 threads) and being able to run the memory at higher speeds would be appreciated.

NUMA support is crucial – as non-NUMA algorithms take a big hit (see GEMM) where performance can be even lower than Ryzen. While complex server or scientific software won’t have this problem, most programs will not be NUMA aware.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS)  111 [+88%]  59  61.5  29 .Net CLR integer performance starts off very well with TR 88% faster than SKL-X an incredible result! This is *not* a fluke as Ryzen scores incredibly too.
BenchDotNetAA .Net Dhrystone Long (GIPS) 62.9 [+3%]  61  41  29 TR cannot match the same gain with 64-bit integer, but still just about manages to beat SKL-X.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS)  193 [+82%]  106  103  50 Floating-Point CLR performance is pretty spectacular with TR (like Ryzen) dominating – it is no less than 82% faster than SKL-X!
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS)  225 [+67%]  134  111  63 FP64 performance is also great with TR 67% faster than SKL-X an absolutely huge win!
It’s pretty incredible, for .Net applications TR – like Ryzen – is king! It is pretty incredible that is is between 60-80% faster in all tests (except 64-bit integer). With more and more applications (apps?) running under the CLR, TR (like Ryzen) has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s)  195 [+38%]
141  92.6  53.4 In this non-vectorised test, TR is almost 40% faster than SKL-X not as high as what we’ve seen before but still significant.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s)  192 [+34%]
 143  97.6  56.5 With 64-bit integer workload this time we see no changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s)  626 [+27%]
 491  347  241 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units but TR is a comfortably 27% faster than it.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s)  344 [+14%]
 301  192  135 Switching to FP64 SIMD vector code – still running AVX/FMA – TR’s lead falls to 14% but it is still a win!
Taking advantage of RyuJit’s support for vectors/SIMD (through SSE2, AVX/FMA) allows SKL-X to gain some traction – TR remains very much faster up to 40%. Whatever the workload, it seems TR just loves it.
Java Arithmetic Java Dhrystone Integer (GIPS)  1000 [+16%]  857 JVM integer performance is only 16% faster on TR than SKL-X – but a win is a win.
Java Arithmetic Java Dhrystone Long (GIPS)  974 [+26%]  771 With 64-bit integer workloads, TR is now 26% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS)  231 [+48%]  156 With a floating-point workload TR increases its lead to a massive 48%, a pretty incredible result.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS)  183 [+14%]  160 With FP64 workload the gap reduces way down to 14% but it is still faster than SKL-X.
Java performance is not as incredible as we’ve seen with .Net, but TR is still 15-50% faster than SKL-X – no mean feat! Again if you have Java workloads, then TR should be the CPU of choice.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s)  200 [+45%]  137 The JVM does not support SIMD/vectors, thus TR uses its scalar prowess to be 45% faster.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s)  186 [+33%]  139 With 64-bit vectorised workload Ryzen is still 33% faster.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s)  169 [+69%]  100 With floating-point, TR is a massive 69% faster than SKL-X a pretty incredible result.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s)  159 [+59%]  100 With FP64 workload TR’s lead falls just a little to 59% – a huge win over SKL-X.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives TR (like Ryzen) free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

TR (like Ryzen) absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than the latest Intel SKL-X – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on TR. For .Net and Java code, TR is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It may be difficult to decide whether AMD’s design (multiple CCX units, multiple dies/nodes on a socket) is “cool” and supporting it effectively is not easy for programmers – be they OS/kernel or application – but when it works it works extremely well! There is no doubt that Threadripper can beat Skylake-X at the same cost (approx 1,000$) though using more coress just as its little (single-die) brother Ryzen.

Scalar workloads, .Net/Java workloads just fly on it – but highly vectorised AVX2/FMA workloads only perform competitively; unfortunately once AVX512 support is added SKL-X is likely to dominate effectively these workloads though for now it’s early days.

It’s multiple NUMA node design – unless running in UMA (unified) mode – requires both OS and application support, otherwise performance can tank to Ryzen levels; while server and scientific programs are likely to be so – this is a problem for most applications. Then we have its dual-CCX design which further complicate workloads, effectively being a 2nd NUMA level; we can see inter-core latencies being 4 tiers while SKL-X only has 2 tiers.

In effect both platforms will get better in the future: Intel’s SKL-X with AVX512 support and AMD’s Threadripper with NUMA/CCX memory optimisations (and hopefully AVX512 support at one point). Intel are also already launching newer versions with more cores (up to 18C/36T) while AMD can release some server EPYC versions with 4 dies (and thus up to 32C/64T) that will both push power envelopes to the maximum.

For now, Threadripper is a return to form from AMD.

AMD Threadripper Review & Benchmarks – 4-channel DDR4 Cache & Memory Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on. The large socket allows for 4 DDR4 memory channels greatly increasing bandwidth over Ryzen, just as with Intel.

AMD Threadripper die

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
TLB 4kB pages
64 full-way
1536 8-way
64 8-way
1536 6-way
64 full-way
1536 8-way
64 8-way
1536 6-way
TR/Ryzen has comparatively “better” TLBs 8-way vs 6-way and full-way vs 8-way.
TLB 2MB pages
64 full-way
1536 2-way
8 full-way
1536 6-way
64 full-way
1536 2-way
8 full-way
1536 6-way
Nothing much changes for 2MB pages with TR/Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-3300 600-1200 800-4000 TR/Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2400 / 2666 2533 / 2400 TR/Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL/X supports only up to 2400 officially but happily runs at 3200MHz a big advantage.
Memory Channels / Width
4 / 256-bit 4 / 256-bit 2 / 128-bit 2 / 128-bit Both TR and SKL-X enjoy 256-bit memory channels.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Despite faster memory, TR/Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on TR/Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

In addition, Threadripper is a NUMA SMP design – with the other nodes effectively different CPUs; thus sharing data between cores on different nodes is equivalent to different CPUs in a SMP system.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). TR (like Ryzen) supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s)  92.2 [+7%]  85.5  47.2  39.5 With 16 cores (and thus 16 pairs) TR’s inter-core bandwidth beats SKL-X by over 7% – assuming threads are scheduled correctly.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 7.51 [1/3]  24.4  5.75  16 In worst-case pairs on TR go not to just different CCX but NUMA nodes thus bandwidth is 1/3 that of SKL-X.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns)  15.4 [-1%]
15.8  15.5  16.1 Within the same core (sharing L1D/L2) , TR/Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Core (ns)  46.4 [-36%]  72.3  44.3  45 Within the same compute unit (sharing L3), the latency is ~45ns is much lower than SKL-X
CPU Multi-Core Benchmark Inter-Unit Latency – Different CCX (ns)  184.7 [+4x]  135 Going inter-CCX increases the latency by 4 times thus threads sharing data must be properly scheduled.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Node(ns)  274.4 [+6x] Going inter-node increases the latency yet again by 6 times, thus scheduling is everything.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled – as latencies are much larger than inter-core; going off node increases latencies yet again but not by a lot; if anything inter-node interconnect seems pretty low latency comparatively.
Aggregated L1D Bandwidth (GB/s)  1372 [-40%] 2252  739  878 SKL/X has 512-bit data ports (for AVX512) so TR/Ryzen cannot compete but they would do better against older designs.
Aggregated L2 Bandwidth (GB/s)  990 [-2%]  1010  565  402 The 16x L2 caches have similar bandwidth to the 10x much bigger caches on SKL-X.
Aggregated L3 Bandwidth (GB/s)  749 [+2.6x]
 289  300  247 The 4x L3 caches have much higher bandwidth than the single SKL-X cache.
Aggregated Memory (GB/s)  56 [-18%]  69  28  31 Running at lower memory speed TR cannot beat SKL-X but has comparatively higher memory efficiency
Even with 16x L1D and L2 caches, TR cannot match the much faster SKL-X 10x caches – that have been updated for 512-bit support but they are competitive; the 4x L3 caches do soundly beat the unified one on SKL-X but then again sharing data not within the same CCX is going to be very much slower.

At 2400Mt/s TR is running 33% slower than SKL-X at 3200Mt/s but its bandwidth is just 18% lower – thus its 4x DDR4 controllers are more efficient – not something we’re used to seeing.

Data In-Page Random Latency (ns)  72.8 [4-17-37] [+2.75x]  26.4 [4-13-33]  70.7 [4-17-37]  20 [4-12-21] What we saw previously with Ryzen was not accident; TR also suffers from surprisingly large in-page latency, almost 3x of Intel designs. Either the TLBs are very slow or not working.
Data Full Random Latency (ns)  111.5 [4-17-44] [+47%]  75.5 [4-13-70]  87.9 [4-17-37]  65 [4-12-34] Out-of-page latencies are ‘better’ with TR/Ryzen ‘only’ ~50% slower than SKL/X.
Data Sequential Latency (ns)  5.5 [4-7-8] [=]  5.4 [4-11-13]  3.8 [4-7-8]
 4.1 [4-12-13] TR’s prefetchers are working well with sequential access pattern latency at ~5ns matching SKL-X.
We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s. It is a mystery as to why, as both out-of-page random and sequential are competitive. It does point to something with the TLBs as to whether they do work or are just very much slower for some reason.
Code In-Page Random Latency (ns)  17.2 [4-10-26] [+43%] 12 [4-14-28]  16.1 [4-9-25]  10 [4-11-21] With code we don’t see the same problem – with in-page latency a bit higher than SKL-X (40%) but nowhere as high as what we saw before.
Code Full Random Latency (ns)  178 [4-15-60] [+2x]  86.1 [4-16-106]  95.4 [4-13-49]  70 [4-11-47] Out-of-page latency is a bit higher than SKL-X but not as bad as before.
Code Sequential Latency (ns)  8.7 [4-10-20] [+33%]  6.5 [4-7-12]  8.4 [4-9-18]  5.3 [4-9-20] Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns and thus 33% higher than SKL-X.
While code access latencies are higher than the new SKL-X – they are comparative with the older designs and not as bad as we’ve seen with data. Overall it seems TR (like Ryzen) will need some memory controller optimisations regarding latencies – though bandwidth seems just great.
Memory Update Transactional (MTPS)  1.9 52.2 [HLE]  4.18  32.4 [HLE] SKL/X is in a world of its own due to support for HLE/RTM and there is not much TR/Ryzen can do about it.
Memory Update Record Only (MTPS)  1.88  57.23 [HLE]  4.22  25.4 [HLE] We see a similar pattern here.
Without HLE/RTM TR (like Ryzen) don’t have much chance against SKL/X but considering support for it is disabled in most SKUs, there’s not much AMD has to be worried about – no to mention Intel disabling it in the older HSW and BRW designs. But should AMD enable it in future designs Intel will have a problem on its hands…

Threadripper’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (16 vs 10); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that TR’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

TR’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs,and especially against older designs. The bandwidths are all competitive and especially the memory controllers seem to be more efficient – but latencies are a bit of a problem which AMD may have to improve in future designs.

Overall we’d still recommend TR over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates.

Intel Core i9 (SKL-X) Review & Benchmarks – 4-channel @ 3200Mt/s Cache & Memory Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels (up to 6 64-bit) for huge memory bandwidth.

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

SKL-X does not only support DDR4 but also NVM-DIMMs (non-volatile memory DIMMs) and PMem (Persistent Memory) that should revolutionise future computing with no need for memory refresh or immediate sleep/resume (no need to save/restore memory from storage).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
TLB 4kB pages
64 4-way / 64 8-way
1536 8-way
64 full-way
1536 8-way
64 4-way / 64 8-way
1536 6-way
64 4-way
1024 8-way
Ryzen has comparatively ‘better’ TLBs than all Intel CPUs.
TLB 2MB pages
8 full-way
1536 2-way
64 full-way
1536 2-way
8 full-way
1536 6-way
8 full-way
1024 8-way
Again Ryzen has ‘better’ TLBs than all Intel versions
Memory Controller Speed (MHz) 800-3300 600-1200 800-4000 1200-4000 Intel’s UNC clock runs higher than Ryzen
Memory Speed (Mhz) Max
3200 / 2667 2400 / 2667 2533 /2667 2133 / 2133 SKL-X can officially go as high as Ryzen and normal SKL @ 2667 but runs happily at 3200Mt/s.
Memory Channels / Width
4 / 256-bit (max 8 / 384-bit) 2 / 128-bit 2 / 128-bit 4 / 256-bit SKL-X has 2 memory controllers each with up to 3 channels each for massive memory bandwidth.
Memory Timing (clocks)
16-18-18-36 6-54-19-4 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T SKL-X can run as tight timings as normal SKL or Ryzen.

Core Topology and Testing

Intel has dropped the (dual) ring bus(es) and instead opted for a mesh inter-connect between cores; on desktop parts this should not cause latency differences between cores (as with Ryzen) but on high-end server parts with many cores (up to 28) this may not be the case. The much increased L2 cache (1MB vs. old 256kB) should alleviate this issue – though the L3 cache seems to have been reduced quite a bit.

Native Performance

We are testing bandwidth and latency performance using all the available SIMD instruction sets (AVX, AVX2/FMA, AVX512) supported by the CPUs.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 87 [+85%] 47.7 39 46 With 10 cores SKL-X has massive aggregated inter-core bandwidth, almost 2x Ryzen or HSW-E.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 19 [+46%] 13 16 17 In worst-case pairs  SKL-X does well but not far away from normal SKL or HSW-E.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.2 15.7 16 13.4 [-12%]
Within the same core all modern CPUs seem to have about 15-16ns latency.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 80 45 [-43%] 49 58 Surprisingly we see massive latency increase almost 2x Ryzen or SKL.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 Naturally Ryzen scores worst when going off-CCX.
It seems the mesh inter-connect between cores has decent bandwidth but much higher latency than the older HSW-E or even the current SKL.
Aggregated L1D Bandwidth (GB/s) 2200 [+3x] 727 878 1150 SKL-X has 512-bit data ports thus massive L1D bandwidth over 2x HSW-E and 3x over Ryzen.
Aggregated L2 Bandwidth (GB/s) 1010 [+81%] 557 402 500 The large L2 caches also have 2x more bandwidth than either HSW-E or Ryzen.
Aggregated L3 Bandwidth (GB/s) 289 392 [+35%] 247 205 The 2 Ryzen L3 caches have higher bandwidth than all Intel CPUs.
Aggregated Memory (GB/s) 69.3 [+2.4x] 28.5 31 42.5 With its 4 channels SKL-X reigns supreme with almost 2.5x more bandwidth than Ryzen.
The widened ports on the L1 and L2 caches allow SKL-X to demolish the competition with over 2x more bandwidth than either Ryzen or older HSW-E; only the smaller L3 cache falters. Its 4 channels running at 3200Mt/s yield huge memory bandwidth that greatly help streaming algorithms. SKL-X is a monster – we can only speculate what the server 6-channel version would score.
Data In-Page Random Latency (ns) 26 [1/2.84x] (4-13-33) 74 (4-17-36) 20 (4-12-21) 25 (4-12-26) SKL-X has comparable lantecy with SKL and HSW-E and much better than Ryzen.
Data Full Random Latency (ns) 75 [-21%] (4-13-70) 95 (4-17-37) 65 (4-12-34) 72 (4-13-52) Full random latencies are a bit higher than expected but on part with HSW-E and better than Ryzen.
Data Sequential Latency (ns) 5.4 [+28%] (4-11-13) 4.2 (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Strangely SKL-X does not do as well as SKL here or Ryzen but at least it beats HSW-E.
If you were hoping SKL-E to match normal SKL that is sadly not the case even at similar Turbo clock they are higher across the board, even allowing Ryzen a win. Perhaps further platform optimisations are needed.
Code In-Page Random Latency (ns) 12 [-27%] (4-14-28) 16.6 (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code SKL-X performs better though not enough to catch normal SKL.
Code Full Random Latency (ns) 86 [-15%] (4-16-106) 102 (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page code latency takes a bigger hit but nothing to worry about.
Code Sequential Latency (ns) 6.5 [-27%] (4-7-12) 8.9 (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Again nothing much changes here.
SKL-X again does not manage to match normal SKL but soundly trounces both Ryzen and its older HSW-E brother, delivering a good result overall. Code access seems to perform more consistently than data for some reason we need to investigate.
Memory Update Transactional (MTPS) 52.2 [+12x] HLE 4.23 32.4 HLE 7 SKL-X with working HLE is over 12-times faster than Ryzen and older HSW-E.
Memory Update Record Only (MTPS) 57.2 [+13.6x] HLE 4.19 25.4 HLE 5.47 SKL-X is king of the hill with nothing getting close.
Yes – Intel has finally fixed HLE/RTL which owners of HSW-E and BRW-E must feel very hard done by considering it was “working” before having it disabled due to the errata. Thus after so many years we have both HLE, RTL and AVX512! Great!

If there was any doubt, SKL-X does not disappoint – massive cache (L1D and L2) aggregate and memory bandwidths with server versions likely even more; the smaller L3 cache does falter though which is a bit of a surprise – the larger L2 caches must have forced some compromises to be made.

Latency is a bit disappointing compared to the “normal” SKL/KBL we have on desktop, but are still better than older HSW-E and also Ryzen competitor. Again the L1 and L2 caches (despite being 4-times bigger) clock latencies are OK with the L3 and memory controller being the source of the increased latencies.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

After a strong CPU performance we did not expect the cache and memory performance to disappoint – and it does not. SKL-X is a big improvement over the older versions (HSW-E) and competition with few weaknesses.

The mesh interconnect does seem to exhibit higher inter-core latencies with small increase in bandwidth; perhaps this can be fixed.

The very much reduced L3 cache does disappoint both bandwidth and latency wise; the memory controllers provide huge bandwidth but at the expense of higher latencies.

All in all, if you can afford it, there is no question that SKL-X is worth it. But better wait to see what AMD’s Threadripper has in store before making your choice… 😉

Intel Core i9 (SKL-X) Review & Benchmarks – CPU 10-core AVX512 Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels:

  • Server 2S, 4S and 8S (sockets)
  • Workstation 1S and 2S
  • Up to 28 cores and 56 threads per CPU
  • Up to 48 PCIe 3.0 lanes
  • Up to 46-bit physical address space and 48-bit virtual address space
  • 512-bit SIMD aka AVX512F, AVX512BandW, AVX512DWandQW

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

On the desktop – Intel is now using the “i9” moniker for its top parts – in a way a much needed change for its top HEDT platform (socket 2011 now socket 2066) to differentiate from its mainstream one.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
Cores (CU) / Threads (SP) 10C / 20T 8C / 16T 4C / 8T 6C / 12T SKL-X manages more cores than Ryzen (10 vs 8) which considering their speed may just be too tough to beat. HSW-E topped at 8 cores also.
Speed (Min / Max / Turbo) 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) 0.8-4.0-4.2GHz (8x-40x-42x) 1.2-3.3-4.0GHz (12x-33x-40x) SKL-X somehow manages higher single-core turbo than even SKL-A (42x v 43x) – but its rated speed is a match for Ryzen and HSW-E.
Power (TDP) 140W 95W 91W 140W Ryzen has comparative TDP to SKL while HSW-E and SKL-X are both almost 50% higher
L1D / L1I Caches 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 6x 32kB 8-way / 6x 32kB 2-way Ryzen instruction cache is 2x the data cache a somewhat strange decision; all caches are 8-way except the HSW-E’s L1I.
L2 Caches 10x 1MB 16-way 8x 512kB 8-way 4x 256kB 8-way 6x 256kB 8-way Surprise surprise – the new SKL-X’ L2 is 4-times the size of SKL/HSW-E and thus even beating Ryzen. Large datasets should have no problem getting cached.
L3 Caches 13.75MB 11-way 2x 8MB 16-way 8MB 16-way 15MB 20-way In a somewhat surprising move, the L3 cache has been reduced pretty drastically and is now smaller than both Ryzen and even the very old HSW-E!

 

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks i9-7900X (Skylake-X) Ryzen 1700X i7-6700K 4C/8T (Skylake)
i7-5820K (Haswell-E)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 446 [+54%] AVX2 290 AVX2 185 AVX2 233 AVX2 Dhrystone does not yet use AVX512 – but no matter SKL-X beats Ryzen by over 50%!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+57%] AVX2 292 AVX2 185 AVX2 230 AVX2 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 271 [+46%] AVX/FMA 185 AVX/FMA 109 AVX/FMA 150 AVX/FMA Whetstone does not yet use AVX512 either – but SKL-X is still approx 50% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 223 [+50%] AVX/FMA 155 AVX/FMA 89 AVX/FMA 116 AVX/FMA With FP64 the winning streak continues.
The Empire strikes back – SKL-X beats Ryzen by a sizeable difference (50%) across integer or floating-point workloads even on “legacy” AVX2/FMA instruction set. It will only get faster once AVX512 is enabled.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1460 [+2.7x] AVX512DQW 535 AVX2 513 AVX2 639 AVX2 For the 1st time we see AVX512 in action and everything is pummeled into dust – almost 3-times faster than Ryzen!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 521 [+3.3x] AVX512DQW 159 AVX2 191 AVX2 191 AVX2 With a 64-bit integer vectorised workload SKL-X is over 3-times faster than Ryzen!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 5.37 [+48%] 3.61 2.15 2.74 This is a tough test using Long integers to emulate Int128 without SIMD and thus SKL-X returns to “just” 50% faster than Ryzen.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+3.4x] AVX512F 530 FMA 479 FMA 601 FMA In this floating-point vectorised test we see again the power of AVX512 with SKL-X is again over 3-times faster than Ryzen!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1140 [+3.8x] AVX512F 300 FMA 271 FMA 345 FMA Switching to FP64 SIMD code SKL-X gets even faster approaching 4-times
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 24 [+84%] AVX512F 13.7 FMA 10.7 FMA 12 FMA In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – SKL-X returns to just 85% faster.
Ryzen’s SIMD units were never strong – splitting 256-bit ops into 2 – but with AV512 SKL-X is unstoppable: integer or floating-point we see it over 3-times faster that is a serious improvement in performance. Even against its older HSW-E it is over 2-times faster a significant upgrade. For heavy vectorised SIMD code – as long as it’s updated to AVX512 – there is no other choice.
BenchCrypt Crypto AES-256 (GB/s) 32.7 [+2.4x] AES 13.8 AES 15 AES 20 AES All  CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – and with 4 memory channels SKL-X reigns supreme – it’s over 2-times faster.
BenchCrypt Crypto AES-128 (GB/s) 32 [+2.3x] AES 13.9 AES 15 AES 20.1 AES What we saw with AES-256 just repeats with AES-128; Ryzen would need more memory channels to even HSW-E never mind SKL-X.
BenchCrypt Crypto SHA2-256 (GB/s) 25 [+46%] AVX512DQW 17.1 SHA 5.9 AVX2 7.6 AVX2 Even Ryzen’s support for SHA hardware acceleration is not enough as memory bandwidth lets it down with SKL-X “only” 50% faster through AVX512.
BenchCrypt Crypto SHA1 (GB/s) 39.3 [+2.3x] AVX512DQW 17.3 SHA 11.3 AVX2 15.1 AVX2 SKL-X only gets faster with the simpler SHA1 and is now over 2-times faster.
BenchCrypt Crypto SHA2-512 (GB/s) 21.1 [+6.3x] AVX512DQW 3.34 AVX2 4.4 AVX2 5.34 AVX2 SHA2-512 is not accelerated by SHA HWA thus Ryzen is forced to use SIMD and loses badly.
Memory bandwidth rules here and SKL-X with its 4-channels of ~100GB/s bandwidth reigns supreme (we can only imagine what the 6-channel beast will score) – so Ryzen loses badly. Its ace card – support for SHA HWA is not enough to “save it” as AVX512 allows SKL-X to power through algorithms like a knife through butter. The 64-bit SHA2-512 test is sobbering with SKL-X no less than 6-times faster than Ryzen.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 320 [+36%] 234 129 157 In this non-vectorised test SKL-X is only 36% faster than Ryzen. SIMD would greaty help it here.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 277 [+40%] 198 108 131 Switching to FP64 code nothing much changes, SKL-X is just 40% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 66.9 [-21%] 85.1 27.2 37.8 Binomial uses thread shared data thus stresses the cache & memory system; somehow Ryzen manages to win this.
BenchFinance Binomial double/FP64 (kOPT/s) 65 [+41%] 45.8 25.5 33.3 With FP64 code the situation gets back to “normal” – with SKL-X again 40% faster than Ryzen.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 64 [+30%] 49.2 25.9 31.6 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; SKL-X is just 30% faster here.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 51 [+36%] 37.3 19.1 21.2 Switching to FP64 where Ryzen did so well – SKL-X returns to 40% faster.
Without the help of its SIMD engine, SKL-X is still 30-40% faster than Ryzen but over 2-times faster than HSW-E showing just how much the core has improved for complex code with lots of shared data (read-only or modifyable). While Ryzen thought it found its “niche” it has been already beaten…
BenchScience SGEMM (GFLOPS) float/FP32 343 [5x] FMA 68.3 FMA 109 FMA 185 FMA GEMM has not yet been updated for AVX512 but SKL-X is an incredible 5x faster!
BenchScience DGEMM (GFLOPS) double/FP64 124 [+2x] FMA 62.7 FMA 72 FMA 87.7 FMA Even without AVX512, with FP64 vectorised code, SKL-X still manages 2x faster.
BenchScience SFFT (GFLOPS) float/FP32 34 [+3.8x] FMA 8.9 FMA 18.9 FMA 18 FMA FFT has also not been updated to AVX512 but SKL-X is still 4x faster than Ryzen!
BenchScience DFFT (GFLOPS) double/FP64 19 [+2.5x] FMA 7.5 FMA 9.3 FMA 10.9 FMA With FP64 SIMD SKL-X is over 2.5x faster than Ryzen in this tough algorithm with loads of memory accesses.
BenchScience SNBODY (GFLOPS) float/FP32 585 [+2.5x] FMA 234 FMA 273 FMA 158 FMA NBODY is not yet updated to AVX512 but again SKL-X wins.
BenchScience DNBODY (GFLOPS) double/FP64 179 [+2x] FMA 87 FMA 79 FMA 40 FMA With FP64 code SKL-X is still 2-times faster than Ryzen.
With highly vectorised SIMD code, even without the help of AVX512, SKL-X is over 2.5x faster than Ryzen, but more than that – almost 4-times faster than its older HSW-E brother!
CPU Image Processing Blur (3×3) Filter (MPix/s) 1639 [+2.2x] AVX2 750 AVX2 655 AVX2 760 AVX2 In this vectorised integer AVX2 workload SKL-X is over 2x faster than Ryzen.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 711 [+2.2x] AVX2 316 AVX2 285 AVX2 345 AVX2 Same algorithm but more shared data does not change anything.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 377 [+2.2x] AVX2 172 AVX2 151 AVX2 188 AVX2 Again same algorithm but even more data shared does not change anything again.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 609 [+2.1x] AVX2 292 AVX2 271 AVX2 316 AVX2 Different algorithm but still SKL-X is still 2x faster than Ryzen.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 79.8 [+36%] AVX2 58.5 AVX2 35.4 AVX2 50.3 AVX2 Still AVX2 vectorised code but here Ryzen does much better, with SKL-X just 36% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 15.7 [+63%] 9.6 6.3 7.6 This test is not vectorised though it uses SIMD instructions and here SKL-X only manages to be 63% faster.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1000 [+17%] 852 422 571 Again in a non-vectorised test Ryzen just flies but SKL-X manages to be 20% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 190 [+29%] 147 75 101 In this final non-vectorised test Ryzen really flies but not enough to beat SKL-X which is 30% faster.
As with other SIMD tests, SKL-X remains just over 2-times faster than Ryzen and about as fast over HSW-E. But without SIMD it drops significantly to just 20-60% showing just how good Ryzen performs.

When using the new AVX512 instruction set – we see incredible performance with SKL-X about 3x faster than its Ryzen competitor and about 2x faster than the older HSW-E; with the older AVX2/FMA instruction sets supported by all CPUs, it is “only” about 2x faster. When using non-vectorised SIMD code its lead shortens to about 30-60%.

While we’ve not tested memory performance in this article, we see that in streaming tests its 4 DDR4 channels trounce 2-channel CPUs that just cannot feed all their cores. Being able to use much faster DDR4 memory (3200 vs 2133) allows it to also soundly beat its older HSW-E brother.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks i9-7900X (Skylake-X) Ryzen 1700X i7-6700K 4C/8T (Skylake)
i7-5820K (Haswell-E)
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 69.8 [+1.9x]
36.5 23.3 30.7 While Ryzen used to dominate .Net CLR workloads, now SKL-X is 2x faster than it and naturally older HSW-E.
BenchDotNetAA .Net Dhrystone Long (GIPS) 60.9 [+35%] 45.1 23.6 28.2 Ryzen seems to do very well here cutting SKL-X’s lead to just 35% – while still being almost 2x faster than HSW-E
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 112 [+12%] 100.6 47.4 65.4 Floating-Point CLR performance is pretty spectacular with Ryzen  and SKL-X only manages 12% faster.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 138 [+14%] 121.3 63.6 85.7 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with SKL-X just 14% faster.
While Ryzen used to dominate .Net workloads, SKL-X restores the balance in Intel’s favour – though in many tests it is just over 10% faster than Ryzen. The CLR definitely seems to prefer Ryzen.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 140 [+50%] 92.6 55.7 75.4 Just as we saw with Dhrystone, this integer workload sees a 50% improvement for SKL-X. While RiuJit supports SIMD integer vectors the lack of bitfield instructions make it slower for our code; shame.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 143 [+47%] 97.8 60.3 79.2 With 64-bit integer workload we see a similar story – SKL-X is about 50% faster.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 543 [+2x] AVX/FMA 272.7 AVX/FMA 12.9 284.2 AVX/FMA Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code – SKL-X strikes back to 2x faster than Ryzen.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 294 [+2x] AVX/FMAX 149 AVX/FMAX 38.7 176.1 AVX/FMA Switching to FP64 SIMD vector code – still running AVX/FMA – SKL-X is still 2x faster.
With RyuJIT’s support for SIMD vector instructions – SKL-X brings its power to bear, being the usual 2-times faster than Ryzen; it does not seem that RyuJIT supports AVX512 yet – something that will make it evern faster. With scalar instructions SKL-X is “only” 50% faster but still about 2x fasster than HSW-E.
Java Arithmetic Java Dhrystone Integer (GIPS) 716 [+39%] 513 313 395 Ryzen puts a strong performance with SKL-X “just” 40% faster. Still it’s almost 2x faster than HSW-E.
Java Arithmetic Java Dhrystone Long (GIPS) 873 [+70%] 514 332 399 Somehow SKL-X does better here with 70% faster than Ryzen.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 155 [+32%] 117
62.8 89 With a floating-point workload Ryzen continues to do well so SKL-X is again “just” 30% faster.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 160 [+25%] 128 64.6 91 With FP64 workload SKL-X’s lead drops to 25%.
With the JVM seemingly favouring Ryzen – and without SIMD – SKL-X is just 25-40% faster than it – but do note it absolutely trounces its older HSW-E brother – being almost 2x faster. So Intel has made big gains but at a cost.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 135 [+40%] 99 59.5 82 Oracle’s JVM does not yet support SIMD vectors so SKL-X is “just” 40% faster than Ryzen.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 132 [+41%] 93 60.6 79 With 64-bit integers nothing much changes.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 97 [+13%] 86 40.6 61 Scary times as SKL-X manages its smallest lead over Ryzen at just over 10%.

Intel better hope Oracle will add vector primitives allowing SIMD code to use the power of its CPU’s SIMD units.

Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 99 [+20%] 82 40.9 63 With FP64 workload SKL-X is lucky to increase its lead to 20%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA, AVX512) allows the competition to creep up on SKL-X in performance but at far lower cost. This is not a good place for Intel to be in.

While Ryzen used to dominate .Net and Java benchmarks – SKL-X restores the balance in Intel’s favour – through both the CLR and JVM do seem to “favour” Ryzen for some reason. If you are running the older HSW-E then you can be sure SKL-X is over 2x faster than it thoughout.

Thus thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run much better on SKL-X than older Intel designs but also reasonably well on Ryzen – at least if not using SIMD vector extensions when SKL-X’s power comes to the fore.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Just when AMD were likely celebrating their fantastic Ryzen, Intel strikes back with a killer – though really expensive CPU. While we’ve not seen major core advances since SandyBridge (SNB and SNB-E) and likely not even see anything new in Coffeelake (CFK) – somehow these improvements add up to quite a lot – with SKL-X soundly beating both Ryzen and its older HSW-E brother.

We finally see AVX512 released and it does not disappoint: SKL-X increases its lead by 50% through it, but note that lower-end CPUs will execute some instructions a lot slower which is unfortunate. Using AVX512 also requires new tools – either compiler which on Windows means the brand-new Visual C++ 2017 or assemblers – and decent amount of work – thus not something most developers will do – at least until the normal desktop/mobile platforms will support it too.

All in all it is a solid upgrade – though costly – but if performance you’re after you can “safely” remain with Intel – you don’t need to join the “rebel camp”. But we’ll need to see what AMD’s Threadripper has in store for us… 😉

FP16 GPGPU Image Processing Performance & Quality

GPGPU Image Processing

What is FP16 (“half”)?

FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal) while CPU support is still limited to SIMD conversion only (FP16C). It has been added to allow mobile devices (phones, tablets) to provide increased performance (and thus save power for fixed workloads) for a small drop in quality for normal 8-bbc (24-bbp) image and video.

However, normal laptops and tablets with integrated graphics can also benefit from FP16 support in same way due to relatively low graphics compute power and the need to save power due to limited battery in thin and light formats.

In this article we’re investigating the performance differences vs. standard FP32 (aka “single”) and the resulting quality difference (if any) for mobile GPGPUs (Intel’s EV9/9.5 SKL/KBL). See the previous articles for general performance comparison:

Image Processing Performance & Quality

We are testing GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Image Filter
FP32/Single FP16/Half Comments
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  481  967 [+2x] We see a a text-book 2x performance increase for no visible drop in quality.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  107  331 [+3.1x] Using FP16 yields over 3x performance increase but we do see a few more changed pixels though no visible difference.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  112  325 [+2.9x] Again almost 3x performance increase but no visible quality difference. Result!
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  107  323 [+3.1x] Again just over 3x performance increase but no visible quality difference.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s) 5.41  5.67 [+4%] No image difference at all but also almost no performance increase – a measly 4%.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  4.7  13.48 [+2.86x] We’re back with a 2.8x times performance increase but few more differences than we’ve seen though quality seems acceptable.
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  1188  1210 [+2%] Due to random no generation using 64-bit integer processing the performance difference is minimal but the picture quality is not acceptable.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s) 470  508 [+8%] Again due to Perlin noise generation we see almost no performance gain but big drop in image quality – not worth it.

Other Image Processing relating Algorithms

Image Filter
FP16/Half FP32/Single FP64/Double Comments
GPGPU Science Benchmark GEMM OpenCL (GFLOPS)  178 [+50%]  118  35 Dropping to FP16 gives us 50% more performance, not as good as 2x but still a significant increase.
GPGPU Science Benchmark FFT OpenCL (GFLOPS)  34 [+70%]  20  5.4 With FFT we are now 70% faster, closer to the 100% promised.
GPGPU Science Benchmark N-Body OpenCL (GFLOPS)  297 [+49%]  199  35 Again we drop to “just” 50% faster with FP16 but still a great performance improvement.

Final Thoughts / Conclusions

For many image processing filters (Blur, Sharpen, Sobel/Edge-Detection, Median/De-Noise, etc.) we see a huge 2-3x performance increase – more than we’ve hoped for (2x) – with little or no image quality degradation. Thus FP16 support is very much useful and should be used when supported.

However for complex filters (Diffusion, Marble/Perlin Noise) the drop in quality is not acceptable for minor performance increase (2-8%); increasing the precision of more data items to improve quality (from FP16 to FP32) would further drop performance making the whole endeavour pointless.

For those algorithms that do benefit from FP16 the performance improvement with FP16 is very much worth it – so FP16 support is very useful indeed.

Intel Graphics GPGPU Performance

Intel Logo

Why test GPGPU performance Intel Core Graphics?

Laptops (and tablets) are still in fashion with desktops largely left to PC game enthusiasts and workstations for big compute workloads; most laptops (and all tablets) make due with integrated graphics with few dedicated graphics options mainly for mobile PC gamers.

As a result integrated graphics on Intel’s mobile platform is what the vast majority of users will experience – thus its importance is not to be underestimated. While in the past integrated graphics options were dire – the introduction of Core v3 (Ivy Bridge) series brought us a GPGPU-capable graphics processor as well an updated internal media transcoder of Core v2 (Sandy Bridge).

With each generation Intel has progressively improved the graphics core, perhaps far more than its CPU cores – and added more variants (GT3) and embedded cache (eDRAM) which greatly increased performance – all within the same power limit.

New Features enabled by the latest 21.45 graphics driver

With Intel graphics drivers supporting just 2 generations of graphics – unlike unified drivers of AMD and nVidia – old graphics quickly become obsolete with few updates; but Windows 10 “free update” forced Intel’s hand somewhat – with its driver (20.40) supporting 3 generations of graphics (Haswell, Broadwell and latest at the time Skylake).

However, the latest 21.45 driver for newly released Kabylake and older Skylake does bring new features that can make a big difference in performance:

  • Native FP64 (64-bit aka “double” floating-point support) in OpenCL – thus allowing high precision compute on integrated graphics.
  • Native FP16 (16-bit aka “half” floating-point support) in OpenCL, ComputeShader – thus allowing lower precision but faster compute.
  • Vulkan graphics interface support – OpenGL’s successor and DirectX 12’s competitor – for faster graphics and compute.

Will these new features make upgrading your laptop to a brand-new KBL laptop more compelling?

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of the new Intel ULV APUs with the old versions.

Graphics Unit Haswell HD4000 Haswell HD5000 Broadwell HD6100 Skylake HD520 Skylake HD540 Kabylake HD620 Comment
Graphics Core EV7.5 HSW GT2U EV7.5 HSW GT3U EV8 BRW GT3U EV9 SKL GT2U EV9 SKL GT3eU EV9.5 KBL GT2U Despite 4 CPU generations we really have 2 GPU generations.
APU / Processor Core i5-4210U Core i7-4650U Core i7-5557U Core i7-6500U Core i5-6260U Core i3-7100U The naming convention has changed between generations.
Cores (CU) / Shaders (SP) / Type 20C / 160SP 40C / 320SP 48C / 384SP 24C / 192SP 48C / 384SP 23C / 184SP BRW increased CUs to 24/48 and i3 misses 1 core.
Speed (Min / Max / Turbo) MHz 200-1000 200-1100 300-1100 300-1000 300-950 300-1000 The turbo clocks have hardly changed between generations.
Power (TDP) W 15 15 28 15 15 15 Except GT3 BRW, all ULVs are 15W rated.
DirectX CS Support 11.1 11.1 11.1 11.2 / 12.1 11.2 / 12.1 11.2 / 12.1 SKL/KBL enjoy v11.2 and 12.1 support.
OpenGL CS Support 4.3 4.3 4.3 4.4 4.4 4.4 SKL/KBL provide v4.4 vs. verision 4.3 for older devices.
OpenCL CS Support 1.2 1.2 1.2 2.0 2.0 2.1 SKL provides v2 support with KBL 2.1 vs 1.2 for older devices.
FP16 / FP64 Support No / No No / No No / No Yes / Yes Yes / Yes Yes / Yes SKL/KBL support both FP64 and FP16.
Byte / Integer Width 8 / 32-bit 8 / 32-bit 8 / 32-bit 128 / 128-bit 128 / 128-bit 128 / 128-bit SKL/KBL prefer vectorised integer workloads, 128-bit wide.
Float/ Double Width 32 / X-bit 32 / X-bit 32 / X-bit 32 / 64-bit 32 / 64-bit 32 / 64-bit Strangely neither arch prefers vectorised floating-point loads – driver bug?
Threads per CU 512 512 256 256 256 256 Strangely BRW and later reduced the threads/CU to 256.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 288 399 597 875 [+3x] 1500 840 [+2.8x] If FP16 is enough, KBL and SKL have 2x performance of FP32.
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 299 375 614 468 [+56%] 817 452 [+50%] SKL GT3e rules the roost but KBL hardly improves on SKL.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 18.54 (eml) 24.4 (eml) 38.9 (eml) 112 [+6x] 193 104 [+5.6x] SKL GT2 with native Fp64 is almost 4x emulated BRW GT3!
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.8 (eml) 2.36 (eml) 4.4 (eml) 6.34 (eml) [+3.5x] 10.92 (eml) 6.1 (eml) [+3.4x] Emulating Fp128 though Fp64 is ~2.5x faster than through FP32.
As expected native FP16 runs about 2x faster than FP32 and thus provides a huge performance upgrade if precision is sufficient. Native FP64 is about 8x emulated FP64 and even emulated FP128 improves by about 2.5x! Otherwise KBL GT2 matches SKL GT2 and is about 50% faster than HSW GT2 in FP32 and 6x faster in FP64.
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 1.37 1.85 2.7 2.19 [+60%] 3.36  2.21 [+60%] Since BRW integer performance is similar.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1.87 2.45 3.45 2.79 [+50%] 4.3 2.83 [+50%] Not a lot changes here.
SKL/KBL GT2 with integer workloads (with extensive memory accesses) are 50-60% faster than HSW similar to what we saw with floating-point performance. But the changed happened with BRW which improved the most over HSW with SKL and KBL not improving further.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s)  1.2 1.62 4.35  3 [+2.5x] 5.12 2.92 In this tough compute test SKL/KBL are 2.5x faster.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2.86  3.93  9.82  6.7 [+2.34x]  11.26  6.49 With a lighter algorithm SKL/KBL are still ~2.4x faster.
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s)  0.828  1.08 1.68 1.08 [+30%] 1.85  1 64-integer performance does not improve much.
In pure integer compute tests SKL/KBL greatly improve over HSW being no less than 2.5x faster a huge improvement; but 64-bit integer performance hardly improves (30% faster with 20% more CUs 24 vs 20). Again BRW is where the improvements were added with SKL GT3e hardly improving over BRW GT3.
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 461 495 493 656 [+42%]  772 618 [+40%] Pure FP32 compute SKL/KBL are 40% faster.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) 137  238 135 SKL GT3 is 73% faster than GT2 variants
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 62.45 85.76 123 86.32 [+38%]  145.6 82.8 [+35%] In this tough algorithm SKL/KBL are still amost 40% faster.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) 18.65 31.46 19 SKL GT3 is over 65% faster than GT2 KBL.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 106 160.4 192 174 [+64%] 295 166.4 [+56%] M/C is not as tough so here SKL/KBL are 60% faster.
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) 31.61 56 31 GT3 SKL manages an 80% improvement over GT2.
Intel is pulling our leg here; KBL GPU seems to show no improvement whatsoever over SKL, but both are about 40% faster in FP32 than the much older HSW. GT3 SKL variant shows good gains of 65-80% over the common GT2 and thus is the one to get if available. Obviously the ace card for SKL and KBL is FP64 support.
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS)  117  130 142 116 [=]  181 113 [=] SKL/GBL have a problem with this algorithm but GT3 does better?
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) 34.9 64.7 34.7 GT3 SKL manages a 86% improvement over GT2.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 13.3 13.1 15 20.53 [+54%]  27.3 21.9 [+64%] In a return to form SKL/KBL are 50% faster.
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) 5.2  4.19  4.69 GT3 stumbles a bit here some optimisations are needed.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS)  122  157.9 249 201 [+64%]  304 177.6 [+45%] Here SKL/KBL are 50% faster overall.
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) 19.25 31.9 17.8 GT3 manages only a 65% improvement here.
Again we see no delta between SKL and KBL – the graphics cores perform the same; again both benefit from FP64 support allowing high precision kernels to run. GT3 SKL variant greatly improves over common GT2 variant – except in one test (DFFT) that seems to be an outlier.
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  341  432  636 492 [+44%]  641 488 [+43%] We see the GT3s trading blows in this integer test, but SKL/KBL are 40% faster than HSW.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  72.7  92.8  147  106 [+45%]  139  106 [+45%] BRW GT3 just wins this with SKL/KBL again 45% faster.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  75.6  96  152  110 [+45%]  149  111 [+45%] Another win for BRW and 45% improvent for SKL/KBL.
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  72.6  90.6  147  105 [+44%]  143  105 [+44%] As above in this test.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s)  2.38  1.53  6.51  5.2 [+2.2x]  7.73  5.32 [+2.23x] SKL’s GT3 manages a win but overall SKl/KBL are over 2x faster than HSW.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  1.17  0.719  5.83  4.57 [+3.9x]  4.58  4.5 [+3.84x] Another win for BRW
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  511  688  1150  1100 [+2.1x]  1750  1080 [+2.05x]_ SKL/KBL are over 2x faster than HSW. BRW is beat here.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s)  378.5  288  424  437 [+15%]  611  443 [+17%] Some wild results here, some optimizations may be needed.
In this integer workloads (with texture access) the 28W GT3 of BRW manages a few wins over 15W GT3e of SKL – but compared to old HSW – both SKL and KBL are between 40 and 300% faster. Again we see no delta between SKL and KBL – there does not seem to be any difference at all!

If you have a HSW GT2 then an upgrade to SKL GT2 brings massive improvements as well as FP16 and FP64 native support. But HSW GT3 variant is competitive and BRW GT3 even more so. KBL GT2 shows no improvement over SKL GT2 – so it’s not just the CPU core that is unchanged but the graphics core also – it’s no EV9.5 here more like EV9.1!

For integer workloads BRW is where the big improvement came but for 64-integer that improvement is still to come, if ever. At least all drivers support native int64.

Transcoding Performance

We are testing media (video + audio) transcoding performance for common video algorithms: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
H.264/AVC Decoder/Encoder QuickSync H264 8-bit only QuickSync H264 8-bit only QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit HSW supports 8-bit only so 10-bit (high-colour) are out of luck.
H.265/HEVC Decoder/Encoder QuickSync H265 8-bit partial QuickSync H265 8-bit QuickSync H265 8-bit QuickSync H265 8/10-bit SKL has full/hardware H265/HEVC transcoding but for 8-bit only; Main10 (10-bit profile) requires KBL so finally we see a difference.
Transcode Benchmark VC 1 > H264/AVC Transcoding (MB/s)  7.55 8.4  7.42 [-2%]  8.25  8.08 [+6%] With DDR4 KBL is 6% faster.
Transcode Benchmark VC 1 > H265/HEVC Transcoding (MB/s)  0.734  3.14 [+4.2x]  3.67  3.63 [+5x] Hardware support makes SKL/KBL 4-5x faster.

If you want HEVC/H.265 then you want SKL including 4k/UHD. But if you plan on using 10-bit/HDR colour then you need KBL – finally an improvement over SKL. As it uses fixed-point hardware the GT3 performs only slightly faster.

Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX/OpenGL ComputeShader,  including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Memory Configuration 8GB DDR3 1.6GHz 128-bit 8GB DDR3 1.6GHz 128-bit 16GB DDR3 1.6GHz 128-bit 8GB DDR3 1.867GHz 128-bit 16GB DDR4 2.133GHz 128-bit 16GB DDR4 2.133GHz 128-bit All use 128-bit memory with SKL/KBL using DDR4.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 64 2048 / 64 2048 / 64 2048 / 64 Shared memory remains the same; in SKL/KBL constant memory is the same as global.
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 10.4 10.7 11 15.65 23 [+2.1x] 19.6 DDR4 seems to provide over 2x bandwidth despite low clock.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 5.23 5.35 5.54 7.74 11.23 [+2.1x] 9.46 Again over 2x increase in up speed.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 5.27 5.36 5.29 7.42 11.31 [+2.1x] 9.6 Again over 2x increase in down speed.
SKL/KBL + DDR4 provide over 2x increase in internal, up and down memory bandwidth – despite the relatively modern increase in memory speed (2133 vs 1600); with DDR3 1867MHz memory the improvement drops to 1.5x. So if you were to decide DDR3 or DDR4 the choice has been made!
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns)  179 192  234 [+30%]  296 235 [+30%] With DDR4 latency has increased by 30% not great.
GPGPU Memory Latency Constant Memory Latency (ns)  92.5  112  234 [+2.53x]  279  235 [+2.53x] Constant memory has effectively been dropped resulting in a disastrous 2.53x higher latencies.
GPGPU Memory Latency Shared Memory Latency (ns)  80  84  –  86.8 [+8%]  102  84.6 [+8%] Shared memory latency has stayed the same.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns)  283  298  56 [1/5x]
 58.1 [1/5x]
Texture access seems to have markedly improved to be 5x faster.
SKL/KBL global memory latencies have increased by 30% with DDR4 – thus wiping out some gains. The “new” constant memory (2GB!) is now really just bog-standard global memory and thus with over 2x increase in latency. Shared memory latency has stayed pretty much the same. Texture memory access is very much faster – 5x faster likely though some driver optimisations.

Again no delta between KBL and SKL; if you want bandwidth (who doesn’t?) DDR4 with modest 2133MHz memory doubles bandwidths – but latencies increase. Constant memory is now the same as global memory and does not seem any faster.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL as well as memory bandwidth performance.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 250  119 602 [+2.4x] 1000 537 [+2.1x] Fp16 support in DirectX doubles performance.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 235  109 338 [+43%]  496 289 [+23%] Fp16 does not yet work in OpenGL.
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s)  238  120 276 [+16%]  485 248 [4%] We only see a measly 4-16% better performance here.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 228  108 338 [+48%] 498 289 [+26%] SKL does better here – it’s 50% faster than HSW.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 52.4  78 76.7 [+46%] 133 69 [+30%] With FP64 SKL is still 45% faster.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 63.2  67.2 105 [+60%] 177 96 [+50%] Similar result here 50-60% faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 5.2  7 18.2 [+3.5x] 31.3 16.7 [+3.2x] Driver optimisation makes SKL/KBL over 3.5x faster.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 5.55  7.5 57.5 [+10x]  97.7 52.3 [+9.4x] Here we see SKL/KBL over 10x faster!
We see similar results to OpenCL GPGPU here – with FP16 doubling performance in DirectX – but with FP64 already supported in both DirectX and OpenGL even with HSW, KBL and SKL have less of a lead – of around 50%.
Video Memory Benchmark Internal Memory Bandwidth (GB/s)  15  14.8 27.6 [+84%]
26.9 25 [+67%] DDR4 brings almost 50% more bandwidth.
Video Memory Benchmark Upload Bandwidth (GB/s)  7  7.8 10.1 [+44%] 12.34 10.54 [+50%] Upload bandwidth has also increased ~50%.
Video Memory Benchmark Download Bandwidth (GB/s)  3.63  3.3 3.53 [-2%] 5.66 3.51 [-3%] No change in download bandwidth though.

Final Thoughts / Conclusions

SKL and KBL with the 21.45 driver yields significant gains in OpenCL making an upgrade from HSW and even BRW quite compelling despite the relatively modern 20.40 driver Intel was forced to provide for Windows 10. The GT3 version provides good gains over the standard GT2 version and should always be selected if available.

Native FP64 support is a huge addition which provides support for high-precision kernels – unheard of for integrated graphics. Native FP16 support provides an additional 2x performance in cases where 16-bit floating-point processing is sufficient.

However KBL’s EV9.5 graphics core shows no improvement at all over SKL’s EV9 core – thus it’s not just the CPU core that has not been changed but the GPU core too! Except for the updated transcoder supporting Main10 HEVC/H.265 (thus HDR / 10-bit+ colour) which is still quite useful for UHD/4K HDR media.

This is very much a surprise – as while the CPU core has not improved markedly since SNB (Core v2), the GPU core has always provided significant improvements – and now we have hit the same road-block. As dedicated GPUs have continued to improve significantly in performance and power efficiency this is quite a surprise. This marks the smallest ever generation to generation – SKL to KBL – ever, effectively KBL is a SKL refresh.

It seems the rumour that Intel may change to ATI/AMD graphics cores may not be such a crazy idea after all!

AMD Ryzen Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen”?

“Ryzen” (code-name ZP aka “Zeppelin”) is the latest generation CPU from AMD (2017) replacing the previous “Vishera”/”Bulldozer” designs for desktop and server platforms. An APU version with an integrated (GP)GPU will be launched later and likely include a few improvements as well.

This is the “make or break” CPU for AMD and thus greatly improve performance, including much higher IPC (instructions per clock), higher sustained clocks, better Turbo performance and “proper” SMT (simultaneous multi-threading). Thus there are no longer “core modules” but proper “cores with 2 SMT threads” so an “eight-core CPU” really sports 8C/16T and not 4M/8T.

The desktop version uses the AM4 socket and requires new boards based on various south-bridges with names similar to Intel’s aka X370 (performance), B350 (mainstream), etc.

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Ryzen 1700X
Intel 6700K (Skylake)
Intel 5820K (Haswell-E) Comments
TLB 4kB pages
64 full-way

1536 8-way

64 8-way

1536 6-way

64 4-way

1024 8-way

Ryzen has comparatively ‘better’ TLBs than even SKL while the 2-generation older HSW-E is showing its age.
TLB 2MB pages
64 full-way

1536 2-way

8 full-way

1536 6-way

8 full-way

1024 8-way

Nothing much changes for 2MB pages with Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-4000 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2133 / 2133 Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL supports only up to 2400 officially but happily runs at 2533MHz; old HSW-E can only do 2133MHz but with 4 memory channels.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 4 / 256-bit HSW-E leads with 4 memory channels of DDR4 providing massive bandwidth for its cores; however both Ryzen and Skylake can use faster DDR4 memory reducing this problem somewhat.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T Despite faster memory Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 47.7 [+3%] 39 46 With 8 cores (and thus 8 pairs) Ryzen’s bandwidth matches the 6-core HSW-E and is 22% higher than SKL thus decent.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 13 [-23%] 16 17 In worst-case pairs on Ryzen must go across CCXes while on Intel CPUs they can still use L3 cache to exchange data – thus it ends up about 23% slower than both SKL and HSW-E but it is not catastrophic.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.7 [-2%] 16 13.4 Within the same core (sharing L1D/L2) , Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 45 [-8%] 49 58 Within the same compute unit (sharing L3), the latency is ~45ns a bit lower than either SKL and much lower than HSW-E thus so far so good!
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 [+3x] Going inter-CCX increases the latency by 3 times to about 130ns thus if the threads are not properly scheduled you can end up with far higher latency and far less bandwidth.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled to avoid inter-CCX transfers where possible. As the CCX link runs at UMC speed using faster memory increases link bandwidth and decreases its latency which helps no end.
Aggregated L1D Bandwidth (GB/s) 727 [-17%] 878 1150 SKL has 512-bit data ports so Ryzen cannot compete with that but it does well against BRW-E.
Aggregated L2 Bandwidth (GB/s) 557 [+38%] 402 500 The 8 L2 caches have competitive bandwidth thus overall Ryzen has though BRW-E does well.
Aggregated L3 Bandwidth (GB/s) 392 [+58%] 247 205 Even spread over the 2 CCXes the L3 caches have huge aggregated bandwidth  – over over SKL.
Aggregated Memory (GB/s) 28.5 [-8%] 31 42.5 Running at lower memory speed Ryzen cannot beat SKL nor BRW-E with its 4 memory controllers but has higher comparative efficiency.
The 8x L2 caches and 2x L3 caches have much higher aggregated latency than either Intel CPU while the memory controller is also more efficient though it cannot compete with 4-channel BRW-E. But its 8x L1D caches are not “wide enough” to compete with SKL’s widened data ports (again widened in HSW). This may be one reason SIMD performance is not as high with Ryzen and AMD may have to widen them going forward especially when adding AVX512 support.
Data In-Page Random Latency (ns) 74 [+2.9x] (4-17-36) 20 (4-12-21) 25.3 (4-12-26) In-page latency is surprisingly large, almost 3x old SNB-E and ~4x SKL! Ryzen’s TLBs seem slow. L1 and L2 latencies are comparative (4 and 17 clk) but L3 latency is already ~50% higher than HSW and almost 2x SKL.
Data Full Random Latency (ns) 95 [+31%] (4-17-37) 65 (4-12-34) 72 (4-13-52) Out-of-page latencies are ‘better’ with Ryzen ‘only’ ~30% slower than HSW-E and about 50% slower than SKL. Again L1 is and L2 are fine (4 and 17 clk) and L3 is comparative to SKL (37 vs 34 clk) while old HSW-E trails (52 clk)!
Data Sequential Latency (ns) 4.2 [+1%] (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency at ~4ns matching SKL and much better than old HSW-E.
We finally discover an issue – Ryzen’s memory latencies (in-page) are very high compared to the competition – TLB issue? Fortunately sequential and out-of-page performance is fine so perhaps its memory prefetchers can alleviate the problem somewhat but it is something that will need to be addressed
Code In-Page Random Latency (ns) 16.6 [+5%] (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code we don’t see the same problem – with in-page latency matching HSW-E, though still about 50% higher than SKL. The twice as large (64kB) L1I cache seems to have the same (4ckl) latency as SKL’s. No issues with L2 nor L3 latencies either.
Code Full Random Latency (ns) 102 [+20%] (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page latency is a bit higher than both Intel CPUs, ~50% higher than SKL but nothing as bad as we’ve seen with data.
Code Sequential Latency (ns) 8.9 [-12%] (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns comparative to HSW-E but again about ~67% higher than SKL.
While code access latencies are higher than the new SKL – they are comparative with the older HSW-E and nowhere near as bad as that we’ve seen with data. Even the twice as large L1I (L1 instruction cache) behaves itself with 4clk latency similar to Intel’s L1I smaller versions. It is thus a mystery with data is affected but not code.
Memory Update Transactional (MTPS) 4.23 [-39%] 32.4 HLE 7 SKL is in a World of its own due to support for HLE/RTM but Ryzen is still about 40% slower than HSW-E with just 6 cores.
Memory Update Record Only (MTPS) 4.19 [-23%] 25.4 HLE 5.47 With only record updates the difference drops to about 20% but again HLE shows its power for transaction processing.
Without HLE/RTM Ryzen is not going to win against Intel’s latest – but then again HLE/RTM are disabled in all but the very top-end CPUs – not to mention killed in previous HSW and BRW architectures so it is not a big problem. But if future models were to enable it, Intel will have a big problem on its hands…

Ryzen’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (8 vs 6 or 4); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that Ryzen’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs, especially the older HSW (and thus BRW) cores while the newer SKL (and thus KBL) cores sporting improved caches and TLBs which are hard for Ryzen to beat. Still it’s nothing to be worried about and perhaps AMD will be able improve things further with microcode/firmware updates if not new steppings and models (e.g. the APU model 10).

Overall we’d still recommend Ryzen over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates. The platform has a bright future with more CPUs destined to use the AM4 socket while both 1551 (SKL/KBL) and 2011 (HSW-E/BRW-E) platforms due to be replaced again with no future upgrades.

AMD Ryzen Review & Benchmarks – CPU 8-core Performance

What is “Ryzen”?

“Ryzen” (code-name ZP aka “Zeppelin”) is the latest generation CPU from AMD (2017) replacing the previous “Vishera”/”Bulldozer” designs for desktop and server platforms. An APU version with an integrated (GP)GPU will be launched later and likely include a few improvements as well.

This is the “make or break” CPU for AMD and thus greatly improve performance, including much higher IPC (instructions per clock), higher sustained clocks, better Turbo performance and “proper” SMT (simultaneous multi-threading). Thus there are no longer “core modules” but proper “cores with 2 SMT threads” so an “eight-core CPU” really sports 8C/16T and not 4M/8T.

The desktop version uses the AM4 socket and requires new boards based on various south-bridges with names similar to Intel’s aka X370 (performance), B350 (mainstream), etc.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design.

Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Ryzen 1700X
Intel 6700K (Skylake)
Intel 5820K (Haswell-E) Comments
Cores (CU) / Threads (SP) 8C / 16T 4C / 8T 6C / 12T Ryzen has the most cores and threads – so it will be down to IPC and clock speeds. But if it’s threads you want Ryzen delivers.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x)  0.8-4.0-4.2GHz (8x-40x-42x)  1.2-3.3-4.0GHz (12x-33x-40x) SKL has the highest rated speed @4GHz but all three have comparative Turbo clocks thus depends on how long they can sustain it.
Power (TDP) 95W 91W 140W Ryzen has comparative TDP to SKL while HSW-E almost 50% higher.
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 6x 32kB 8-way / 6x 32kB 2-way Ryzen instruction cache is 2x the data cache a somewhat strange decision; all caches are 8-way except the HSW-E’s L1I.
L2 Caches 8x 512kB 8-way 4x 256kB 8-way 6x 256kB 8-way Ryzen L2 is 2x as big as either Intel CPU which should help quite a bit though still 8-way
L3 Caches 2x 8MB 16-way 8MB 16-way 15MB 20-way With 2x as many cores/threads, Ryzen has 2 8MB caches one for each CCX.

Thread Scheduling and Windows

Ryzen’s topology (4 cores in 2 CCXes (compute clusters)) makes it akin to the old Core 2 Quad or Pentium D (2 dies onto 1 socket) effectively a SMP (dual CPU) system on a single socket. Windows has always tended to migrate running threads from unit to unit in order to equalise thermal dissipation though Windows 10/Server 2016 have increased the ‘stickiness’ of threads to units.

As the Windows’ scheduler is inter-twined with the power management system, under ‘Balanced‘ and other power saving profiles – unused cores are ‘parked’ (aka powered down) which affects which cores are available for scheduling. AMD has recommended ‘High Performance‘ profile as well as initially claiming the Windows’ scheduler is not ‘Ryzen-aware’ before retracting the statement.

However, there does seem to be a problem as in Sandra tests when using less than the total 16 threads (e.g. MC test with 8 threads) in tests where Sandra does not hard schedule threads based on its own scheduler (e.g. .Net, Java benchmarks) the scheduling does not appear optimal:

 Ryzen Hard Affinity  Ryzen no Affinity
Ryzen Hard Affinity (e.g. Native) Ryzen No Affinity (e.g. Java/.Net)

While in the left image we see Sandra at work assigning the 8 threads on the 8 different cores – with 100% utilisation on those units and almost nothing on the other 8 – on the right image we see 10 units (!) used, 4 not used at all but still 50% utilisation.

This does not seem to happen on Intel hardware – even SMP systems – thus it may be something to be adjusted in future Windows versions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 290 [+24%] | 242 [+13%] AVX2 185 | 146 233 | 213 Right off the bat Ryzen beats both Intel CPUs in both MT and MC tests with SMT providing a good gain (hard scheduled of course).
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 292 [+27%] | 260 [+22%] AVX2 185 | 146 230 | 213 With a 64-bit integet workload nothing much changes, Ryzen still beats both in both tests, 27% faster than HSW-E! AMD has ri-sen from the ashes like the Phoenix!
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 185 [+23%] | 123 [+23%] AVX/FMA 109 | 74 150 | 100 Even in this floating-point test, Ryzen beats both again by a similar margin, 23% better than HSW-E. What performance for the money!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 155 [+33%] | 102 [+32%] AVX/FMA 89 | 60 116 | 77 With FP64 the winning streak continues, with the difference increasing to 33% over HSW-E a huge gain.
From integer workloads in Dhyrstone to floating-point workloads in Whestone Ryzen rules the roost blowing both SKL and HSW-E away being between 23-33% faster, with or without SMT. SMT does yield bigger gain than on Intel’s designs also.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 535 [-16%] | 421 [-13%] AVX2 513 | 389 639 | 485 In this vectorised AVX2 integer test Ryzen just overtakes SKL but cannot beat HSW-E and is just 16% slower; still it is a good result but it shows Intel’s SIMD units are really strong with AMD’s 8 cores matching Intel’s 4 cores.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 159 [-16%] | 137 [-18%] AVX2 191 | 158 191 | 168 With a 64-bit AVX2 integer vectorised workload again Ryzen is unable to beat either Intel CPU being slower by a similar margin -16%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 3.61 [+30%] | 2.1 [+11%] 2.15 | 1.36 2.74 | 1.88 This is a tough test using Long integers to emulate Int128 without SIMD and here Ryzen comes back on top being 30% faster similar to what we saw in Dhrystone.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 530 [-11%] | 424 [-4%] FMA 479 | 332 601 | 440 In this floating-point AVX/FMA vectorised test we see again the power of Intel’s SIMD units, with Ryzen being only 11% slower than HSW-E but beating SKL.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 300 [-13%] | 247 [=] FMA 271 | 189 345 | 248 Switching to FP64 SIMD code, again Ryzen cannot beat HSW-E but does beat SKL which should be sufficient.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 13.7 [+14%] | 9.7 [+2%] FMA 10.7 | 7.5 12 | 9.5 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – Ryzen manages to beat both CPUs being 14% faster. So AVX2 or FMA code is not a problem.
In vectorised AVX2/FMA code we see Ryzen lose for the first time to Intel’s SIMD units but not by a large margin; in non-vectorised code as with Dhrystone and Whetstone Ryzen is again quite a bit faster than either Intel CPUs. Overall Ryzen would be the preferred choice unless number-crunching vectorised code.
BenchCrypt Crypto AES-256 (GB/s) 13.8 [-31%] | 14 [-32%] AES 15 | 15.4 20 | 20.7 All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – and 2 memory channels is just not enough; with its 4 channels HSW-E is unbeatable for streaming tests. But Ryzen is only marginally slower than its counterpart SKL.
BenchCrypt Crypto AES-128 (GB/s) 13.9 [-31%] | 14 [-33%] AES 15 | 15.4 20.1 | 21.2 What we saw with AES-256 just repeats with AES-128; Ryzen would need more memory channels to beat HSW-E but at least is marginally slower than SKL.
BenchCrypt Crypto SHA2-256 (GB/s) 17.1 [+2.25x] | 10.6 [+49%] SHA 5.9 | 5.5 AVX2 7.6 | 7.1 AVX2 Ryzen’s secret weapon is revealed: by supporting SHA HWA it soundly beats both Intel CPUs even running multi-buffer vectorised AVX2 code – it’s 2.2x faster! Surprisingly disabling SMT (MC mode) reduces performance appreciably, not what would be expected.
BenchCrypt Crypto SHA1 (GB/s) 17.3 [+14%] | 11.4 [-14%] SHA 11.3 | 10.6 AVX2 15.1 | 13.3 AVX2 Ryzen also accelerates the soon-to-be-defunct SHA1 but the AVX2 implementation is much less complex allowing SNB-E to come within a whisker of Ryzen and beat it in MC mode by a similar amount 14%. Still, much better to have SHA HWA than finding multiple buffers to process with AVX2.
BenchCrypt Crypto SHA2-512 (GB/s) 3.34 [-37%] | 3.32 [-36%] AVX2 4.4 | 4.2 5.34 | 5.2 SHA2-512 is not accelerated by SHA HWA (version 1) thus Ryzen has to use the same vectorised AVX2 code path where Intel’s SIMD units show their power again.
Ryzen’s secret crypto weapon is support for SHA HWA (which Intel only supports on Atom currently) which allows it to beat both Intel’s CPUs. For streaming algorithms like encrypt/decrypt it would probably benefit from more memory channels to feed all those cores. But overall it would still be the overall choice.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 234 [+49] | 166 [+36%] 129 | 97 157 | 122 In this non-vectorised test we see Ryzen shine brightly again beating even SNB-E by 50% an incredible result. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s) 198 [+51%] | 139 [+39%] 108 | 83 131 | 100 Switching to FP64 code, Ryzen still shines beating SNB-E by 50% again and totally demolishing SKL. So far so great!
BenchFinance Binomial float/FP32 (kOPT/s) 85.1 [+2.25x] | 83.2 [+3.23x] 27.2 | 18.1 37.8 | 25.7 Binomial uses thread shared data thus stresses the cache & memory system; we would expect Ryzen to falter here but nothing of the sort – it actually totally beats both Intel CPUs to dust – it’s 2.25 times faster than SNB-E! Even a 12 core SNB-E would not be sufficient.
BenchFinance Binomial double/FP64 (kOPT/s) 45.8 [+37%] | 46.3 [+38%] 25.5 | 24.6 33.3 | 33.5 With FP64 code the situation changes somewhat – with Ryzen only 37% faster than SNB-E; but it’s still an appreciable win. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 49.2 [+55%] | 41.2 [+52%] 25.9 | 21.9 31.6 | 27 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen reigns supreme here also being 50% faster than even HSW-E. SKL is left in the dust.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 37.3 [+75%] | 31.8 [+41%] 19.1 | 17.2 21.2 | 22.5 Switching to FP64 Ryzen increases its dominance to 75% over SNB-E and destroying SKL completely.
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads Ryzen reigns supreme beating even 6-core Haswell-E into dust by such a margin that even a 12-core HSW-E may not beat it. For financial workloads there is only one choice: Ryzen!!! Long live the new king!
BenchScience SGEMM (GFLOPS) float/FP32 68.3 [-63%] | 155 [-27%] FMA 109 | 162 185 | 213 In this tough vectorised AVX2/FMA algorithm Ryzen falters and gets soundly beaten by both SKL and HSW-E. Again the powerful SIMD units of Intel’s CPUs allow them to finally beat it as we’ve seen in previous tests. It’s its Achille’s heel.
BenchScience DGEMM (GFLOPS) double/FP64 62.7 [-28%] | 78.4 [-23%] FMA 72 | 67.8 87.7 | 103 With FP64 vectorised code, the gap reduces with Ryzen just 28% slower than HSW-E and just a bit slower than SKL. Again vectorised SIMD code is problematic.
BenchScience SFFT (GFLOPS) float/FP32 8.9 [-50%] | 9.85 [-39%] FMA 18.9 | 15 18 | 16.4 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; here Ryzen is again much slower than both SKL and HSW-E; for vectorised code it seems it needs 2x more SIMD units to match Intel.
BenchScience DFFT (GFLOPS) double/FP64 7.5 [-31%] | 7.3 [-30%] FMA 9.3 | 9 10.9 | 10.5 With FP64 code, Ryzen does improve (or Intel gets slower) only 30% slower than HSW-E and 15% slower than SKL.
BenchScience SNBODY (GFLOPS) float/FP32 234 [-15%] | 225 [-16%] FMA 273 | 271 158 | 158 N-Body simulation is vectorised but many memory accesses to shared data and here SKL seems to do unusually well beating Ryzen in 2nd place but only by 15%. Strangely HSW-E does badly in this test even with 6-cores.
BenchScience DNBODY (GFLOPS) double/FP64 87 [+10%] | 87 FMA 79 | 79 40 | 40 With FP64 code Ryzen improves beating its SKL rival by 10%; again SNB-E does pretty badly in this test.
With highly vectorised SIMD code Ryzen is again the loser but not by a lot; Intel has just one chance – highly vectorised SIMD algorithms that allow the powerful SIMD units to shine. Everything else is dominated by Ryzen.
CPU Image Processing Blur (3×3) Filter (MPix/s) 750 [-1%] | 699 [+4%] AVX2 655 | 563 760 | 668 In this vectorised integer AVX2 workload Ryzen ties with HSW-E, a good result considering we saw it lose in similar algorithms.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 316 [-8%] | 301 AVX2 285 | 258 345 | 327 Same algorithm but more shared data used sees Ryzen now 8% slower than SNB-E but still beating SKL.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 172 [-8%] | 166 AVX2 151 | 141 188 | 182 Again same algorithm but even more data shared does not change anything, Ryzen is again 8% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 292 [-7%] | 279 AVX2 271 | 242 316 | 276 Different algorithm but still AVX2 vectorised workload sees Ryzen still about 7% slower than HSW-E but again still faster than SKL.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 58.5 [+16%] | 37.4 AVX2 35.4 | 26.4 50.3 | 37 Still AVX2 vectorised code but here Ryzen manages to beat even SNB-E by 16%. Thus it is not a given it will lose in all such tests, it just depends.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.6 [+26%] | 5.2 6.3 | 4.2 7.6 | 5.5 This test is not vectorised though it uses SIMD instructions and here Ryzen manages a 26% win even over SNB-E while leaving SKL in the dust.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 852 [+50%] | 525 422 | 297 571 | 420 Again in a non-vectorised test Ryzen just flies: it’s 2x faster than SKL and no less than 50% faster than SNB-E! Intel does not have its way all the time – unless the code is highly vectorised!
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 147 [+47%] | 101 75 | 55 101 | 77 In this final non-vectorised test Ryzen really flies, it’s again 2x faster than SKL and almost 50% faster than SNB-E! Intel must be getting desperate for SIMD cectorised versions of algorithms by now…

With all the modern instruction sets supported (AVX2, FMA, AES and SHA HWA) Ryzen does extremely well beating both Skylake 4C and even Haswell-E 6C in all workloads except highly vectorised SIMD code where the powerful Intel SIMD units can shine. Overall it would still be the choice for most workloads but SIMD number-crunching tasks which are somewhat specialised.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) more memory bandwidth to feed all the 16-threads would not go amiss but the difference may not justify the increased cost as we see with Intel 2011 platform and HSW-E.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.6.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 36.5 [+18%] | 25 23.3 | 17.2 30.7 | 26.8 .Net CLR integer performance starts off very well with a 36% better performance even over HSW-E which admittedly does not do much better over SKL.
BenchDotNetAA .Net Dhrystone Long (GIPS) 45.1 [+60%] | 26 23.6 | 21.6 28.2 | 25 Ryzen seems to greatly favour 64-bit integer workloads, here it is 60% faster than even HSW-E and over 2x faster than SKL. All CPUs perform better with 64-bit workloads.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 100.6 [+53%] | 53 47.4 | 21.4 65.4 | 39.4 Floating-Point CLR performance is pretty spectacular with Ryzen beating HSW-E by over 50% a pretty incredible result. Native or CLR code works just great on Ryzen.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 121.3 [+41%] | 62 63.6 | 37.5 85.7 | 53.4 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen just over 40% faster than HSW-E.
It’s pretty incredible, for .Net applications Ryzen is king – no point buying Intel’s 2011 platform – buy Ryzen! With more and more applications (apps?) running under the CLR, Ryzen has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 92.6 [+22%] | 49 55.7 | 37.5 75.4 | 49.9 Just as we saw with Dhrystone, this integer workload sees a 22% improvement for Ryzen. While RiuJit supports SIMD integer vectors the lack of bitfield instructions make it slower for our code; shame.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 97.8 [+23%] | 51 60.3 | 39.5 79.2 | 53.1 With 64-bit integer workload we see a similar story – Ryzen is 23% faster than even HSW-E. If only RyuJit SIMD would fix integer workloads too.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 272.7 [-4%] | 156 AVX 12.9 | 6.74 284.2 | 187.1 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units with Ryzen 4% slower than SNB-E. Still Intel usually wins these kinds of tests.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 149 [-15%] | 85 AVX 38.7 | 21.38 176.1 | 103.3 Switching to FP64 SIMD vector code – still running AVX/FMA – Ryzen loses again, this time by 15% against SNB-E.
The only tests Intel’s CPUs can win are vectorised ones using RyuJit’s support for SIMD (aka SSE2, AVX/FMA) and thus allowing Intel’s SIMD units to shine; otherwise Ryzen dominates absolutely everything without fail.
Java Arithmetic Java Dhrystone Integer (GIPS)  513 [+29%] | 311  313 | 289  395 | 321 We start JVM integer performance with an even bigger gain, Ryzen is ~30% faster than HSW-E and 60% faster than SKL.
Java Arithmetic Java Dhrystone Long (GIPS) 514 [+28%] | 311 332 | 299 399 | 367 Nothing much changes with 64-bit integer workload, we have Ryzen 28% faster than HSW-E.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 117 [+31%] | 66 62.8 | 34.6 89 | 49 With a floating-point workload Ryzen continues its lead over both Intel’s CPUs. Native or CLR or JVM code works just great on Ryzen.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 128 [+40%] | 63 64.6 | 36 91 | 53 With FP64 workload the gap increases even further to 40% over HSW-E and an incredible 2x over SKL! Ryzen is the JVM king.
Java performance is even more incredible than what we’ve seen in .Net; server people rejoice, if you have Java workloads Ryzen is the CPU for you! 40% better performance than Intel’s 2011 platform for much lower cost? Yes please!
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 99 [+20%] | 52.6 59.5 | 36.5 82 | 49 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but here Ryzen manages a 20% lead over HSW-E but is almost 2x faster than SKL.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 93 [+17%] | 51 60.6 | 37.7 79 | 53 With 64-bit vectorised workload Ryzen maintains its lead of about 20%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 86 [+40%] | 42.3 40.6 | 22.1 61 | 32 Just as we’ve seen with Whetstone, Ryzen is about 40% faster than HSW-E and over 2x faster than SKL! It does not get a lot better than this.

Intel better hope Oracle will add vector primitives allowing SIMD code to use the power of its CPU’s SIMD units.

Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 82 [+30%] | 42 40.9 | 22.1 63 | 32 With FP64 workload Ryzen’s lead somewhat unexplicably drops to ‘just’ 30% but remains over 2x faster than SKL. Nothing to grumble about really.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives Ryzen free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

Ryzen absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than on Intel’s (ex)-top-of-the-range Haswell-E – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on Ryzen. For .Net and Java code, Ryzen is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

What a return of fortune from AMD! Despite a hurried launch and inevitable issues which will be fixed in time (e.g. Windows scheduler), Ryzen puts a strong performance beating Intel’s previous top-of-the-range Skylake 6700K and Haswell-E 6820K into dust in most tests at a much cheaper price.

Of course there are setbacks, highly vectorised AVX2/FMA code greatly favour Intel’s SIMD units and here Ryzen falls behind a bit; streaming algorithms can overload the 2 memory channels but then again Intel’s mainstream platform has only 2 also. Still if you were replacing a 2011 4-channel platform with Ryzen then very high-speed memory may be required to sustain performance.

It’s dual-CCX design may also affect non-symmetrical workloads where different threads execute different code with thread data-sharing across CCX naturally slower. Clever thread assignment to the ‘right’ CCX should fix those issues but that is down to each application with Windows (or other OSes) may not be able to fix. Considering we have SMP and NUMA systems out there – it is not a new problem but perhaps one not usually seen on normal desktop systems due to the high-cost of SMP/NUMA systems.

All in all Ryzen is a solid CPU which should worry Intel at the high-end, we shall have to see how the lower-end 4-core and even 2-core versions perform.