Intel Core i9 (SKL-X) Review & Benchmarks – 4-channel @ 3200Mt/s Cache & Memory Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels (up to 6 64-bit) for huge memory bandwidth.

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

SKL-X does not only support DDR4 but also NVM-DIMMs (non-volatile memory DIMMs) and PMem (Persistent Memory) that should revolutionise future computing with no need for memory refresh or immediate sleep/resume (no need to save/restore memory from storage).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
TLB 4kB pages
64 4-way / 64 8-way
1536 8-way
64 full-way
1536 8-way
64 4-way / 64 8-way
1536 6-way
64 4-way
1024 8-way
Ryzen has comparatively ‘better’ TLBs than all Intel CPUs.
TLB 2MB pages
8 full-way
1536 2-way
64 full-way
1536 2-way
8 full-way
1536 6-way
8 full-way
1024 8-way
Again Ryzen has ‘better’ TLBs than all Intel versions
Memory Controller Speed (MHz) 800-3300 600-1200 800-4000 1200-4000 Intel’s UNC clock runs higher than Ryzen
Memory Speed (Mhz) Max
3200 / 2667 2400 / 2667 2533 /2667 2133 / 2133 SKL-X can officially go as high as Ryzen and normal SKL @ 2667 but runs happily at 3200Mt/s.
Memory Channels / Width
4 / 256-bit (max 8 / 384-bit) 2 / 128-bit 2 / 128-bit 4 / 256-bit SKL-X has 2 memory controllers each with up to 3 channels each for massive memory bandwidth.
Memory Timing (clocks)
16-18-18-36 6-54-19-4 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T SKL-X can run as tight timings as normal SKL or Ryzen.

Core Topology and Testing

Intel has dropped the (dual) ring bus(es) and instead opted for a mesh inter-connect between cores; on desktop parts this should not cause latency differences between cores (as with Ryzen) but on high-end server parts with many cores (up to 28) this may not be the case. The much increased L2 cache (1MB vs. old 256kB) should alleviate this issue – though the L3 cache seems to have been reduced quite a bit.

Native Performance

We are testing bandwidth and latency performance using all the available SIMD instruction sets (AVX, AVX2/FMA, AVX512) supported by the CPUs.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 87 [+85%] 47.7 39 46 With 10 cores SKL-X has massive aggregated inter-core bandwidth, almost 2x Ryzen or HSW-E.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 19 [+46%] 13 16 17 In worst-case pairs  SKL-X does well but not far away from normal SKL or HSW-E.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.2 15.7 16 13.4 [-12%]
Within the same core all modern CPUs seem to have about 15-16ns latency.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 80 45 [-43%] 49 58 Surprisingly we see massive latency increase almost 2x Ryzen or SKL.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 Naturally Ryzen scores worst when going off-CCX.
It seems the mesh inter-connect between cores has decent bandwidth but much higher latency than the older HSW-E or even the current SKL.
Aggregated L1D Bandwidth (GB/s) 2200 [+3x] 727 878 1150 SKL-X has 512-bit data ports thus massive L1D bandwidth over 2x HSW-E and 3x over Ryzen.
Aggregated L2 Bandwidth (GB/s) 1010 [+81%] 557 402 500 The large L2 caches also have 2x more bandwidth than either HSW-E or Ryzen.
Aggregated L3 Bandwidth (GB/s) 289 392 [+35%] 247 205 The 2 Ryzen L3 caches have higher bandwidth than all Intel CPUs.
Aggregated Memory (GB/s) 69.3 [+2.4x] 28.5 31 42.5 With its 4 channels SKL-X reigns supreme with almost 2.5x more bandwidth than Ryzen.
The widened ports on the L1 and L2 caches allow SKL-X to demolish the competition with over 2x more bandwidth than either Ryzen or older HSW-E; only the smaller L3 cache falters. Its 4 channels running at 3200Mt/s yield huge memory bandwidth that greatly help streaming algorithms. SKL-X is a monster – we can only speculate what the server 6-channel version would score.
Data In-Page Random Latency (ns) 26 [1/2.84x] (4-13-33) 74 (4-17-36) 20 (4-12-21) 25 (4-12-26) SKL-X has comparable lantecy with SKL and HSW-E and much better than Ryzen.
Data Full Random Latency (ns) 75 [-21%] (4-13-70) 95 (4-17-37) 65 (4-12-34) 72 (4-13-52) Full random latencies are a bit higher than expected but on part with HSW-E and better than Ryzen.
Data Sequential Latency (ns) 5.4 [+28%] (4-11-13) 4.2 (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Strangely SKL-X does not do as well as SKL here or Ryzen but at least it beats HSW-E.
If you were hoping SKL-E to match normal SKL that is sadly not the case even at similar Turbo clock they are higher across the board, even allowing Ryzen a win. Perhaps further platform optimisations are needed.
Code In-Page Random Latency (ns) 12 [-27%] (4-14-28) 16.6 (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code SKL-X performs better though not enough to catch normal SKL.
Code Full Random Latency (ns) 86 [-15%] (4-16-106) 102 (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page code latency takes a bigger hit but nothing to worry about.
Code Sequential Latency (ns) 6.5 [-27%] (4-7-12) 8.9 (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Again nothing much changes here.
SKL-X again does not manage to match normal SKL but soundly trounces both Ryzen and its older HSW-E brother, delivering a good result overall. Code access seems to perform more consistently than data for some reason we need to investigate.
Memory Update Transactional (MTPS) 52.2 [+12x] HLE 4.23 32.4 HLE 7 SKL-X with working HLE is over 12-times faster than Ryzen and older HSW-E.
Memory Update Record Only (MTPS) 57.2 [+13.6x] HLE 4.19 25.4 HLE 5.47 SKL-X is king of the hill with nothing getting close.
Yes – Intel has finally fixed HLE/RTL which owners of HSW-E and BRW-E must feel very hard done by considering it was “working” before having it disabled due to the errata. Thus after so many years we have both HLE, RTL and AVX512! Great!

If there was any doubt, SKL-X does not disappoint – massive cache (L1D and L2) aggregate and memory bandwidths with server versions likely even more; the smaller L3 cache does falter though which is a bit of a surprise – the larger L2 caches must have forced some compromises to be made.

Latency is a bit disappointing compared to the “normal” SKL/KBL we have on desktop, but are still better than older HSW-E and also Ryzen competitor. Again the L1 and L2 caches (despite being 4-times bigger) clock latencies are OK with the L3 and memory controller being the source of the increased latencies.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

After a strong CPU performance we did not expect the cache and memory performance to disappoint – and it does not. SKL-X is a big improvement over the older versions (HSW-E) and competition with few weaknesses.

The mesh interconnect does seem to exhibit higher inter-core latencies with small increase in bandwidth; perhaps this can be fixed.

The very much reduced L3 cache does disappoint both bandwidth and latency wise; the memory controllers provide huge bandwidth but at the expense of higher latencies.

All in all, if you can afford it, there is no question that SKL-X is worth it. But better wait to see what AMD’s Threadripper has in store before making your choice… 😉

Intel Core i9 (SKL-X) Review & Benchmarks – CPU 10-core AVX512 Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels:

  • Server 2S, 4S and 8S (sockets)
  • Workstation 1S and 2S
  • Up to 28 cores and 56 threads per CPU
  • Up to 48 PCIe 3.0 lanes
  • Up to 46-bit physical address space and 48-bit virtual address space
  • 512-bit SIMD aka AVX512F, AVX512BandW, AVX512DWandQW

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

On the desktop – Intel is now using the “i9” moniker for its top parts – in a way a much needed change for its top HEDT platform (socket 2011 now socket 2066) to differentiate from its mainstream one.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
Cores (CU) / Threads (SP) 10C / 20T 8C / 16T 4C / 8T 6C / 12T SKL-X manages more cores than Ryzen (10 vs 8) which considering their speed may just be too tough to beat. HSW-E topped at 8 cores also.
Speed (Min / Max / Turbo) 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) 0.8-4.0-4.2GHz (8x-40x-42x) 1.2-3.3-4.0GHz (12x-33x-40x) SKL-X somehow manages higher single-core turbo than even SKL-A (42x v 43x) – but its rated speed is a match for Ryzen and HSW-E.
Power (TDP) 140W 95W 91W 140W Ryzen has comparative TDP to SKL while HSW-E and SKL-X are both almost 50% higher
L1D / L1I Caches 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 6x 32kB 8-way / 6x 32kB 2-way Ryzen instruction cache is 2x the data cache a somewhat strange decision; all caches are 8-way except the HSW-E’s L1I.
L2 Caches 10x 1MB 16-way 8x 512kB 8-way 4x 256kB 8-way 6x 256kB 8-way Surprise surprise – the new SKL-X’ L2 is 4-times the size of SKL/HSW-E and thus even beating Ryzen. Large datasets should have no problem getting cached.
L3 Caches 13.75MB 11-way 2x 8MB 16-way 8MB 16-way 15MB 20-way In a somewhat surprising move, the L3 cache has been reduced pretty drastically and is now smaller than both Ryzen and even the very old HSW-E!


Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks i9-7900X (Skylake-X) Ryzen 1700X i7-6700K 4C/8T (Skylake)
i7-5820K (Haswell-E)
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 446 [+54%] AVX2 290 AVX2 185 AVX2 233 AVX2 Dhrystone does not yet use AVX512 – but no matter SKL-X beats Ryzen by over 50%!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+57%] AVX2 292 AVX2 185 AVX2 230 AVX2 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 271 [+46%] AVX/FMA 185 AVX/FMA 109 AVX/FMA 150 AVX/FMA Whetstone does not yet use AVX512 either – but SKL-X is still approx 50% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 223 [+50%] AVX/FMA 155 AVX/FMA 89 AVX/FMA 116 AVX/FMA With FP64 the winning streak continues.
The Empire strikes back – SKL-X beats Ryzen by a sizeable difference (50%) across integer or floating-point workloads even on “legacy” AVX2/FMA instruction set. It will only get faster once AVX512 is enabled.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1460 [+2.7x] AVX512DQW 535 AVX2 513 AVX2 639 AVX2 For the 1st time we see AVX512 in action and everything is pummeled into dust – almost 3-times faster than Ryzen!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 521 [+3.3x] AVX512DQW 159 AVX2 191 AVX2 191 AVX2 With a 64-bit integer vectorised workload SKL-X is over 3-times faster than Ryzen!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 5.37 [+48%] 3.61 2.15 2.74 This is a tough test using Long integers to emulate Int128 without SIMD and thus SKL-X returns to “just” 50% faster than Ryzen.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+3.4x] AVX512F 530 FMA 479 FMA 601 FMA In this floating-point vectorised test we see again the power of AVX512 with SKL-X is again over 3-times faster than Ryzen!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1140 [+3.8x] AVX512F 300 FMA 271 FMA 345 FMA Switching to FP64 SIMD code SKL-X gets even faster approaching 4-times
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 24 [+84%] AVX512F 13.7 FMA 10.7 FMA 12 FMA In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – SKL-X returns to just 85% faster.
Ryzen’s SIMD units were never strong – splitting 256-bit ops into 2 – but with AV512 SKL-X is unstoppable: integer or floating-point we see it over 3-times faster that is a serious improvement in performance. Even against its older HSW-E it is over 2-times faster a significant upgrade. For heavy vectorised SIMD code – as long as it’s updated to AVX512 – there is no other choice.
BenchCrypt Crypto AES-256 (GB/s) 32.7 [+2.4x] AES 13.8 AES 15 AES 20 AES All  CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – and with 4 memory channels SKL-X reigns supreme – it’s over 2-times faster.
BenchCrypt Crypto AES-128 (GB/s) 32 [+2.3x] AES 13.9 AES 15 AES 20.1 AES What we saw with AES-256 just repeats with AES-128; Ryzen would need more memory channels to even HSW-E never mind SKL-X.
BenchCrypt Crypto SHA2-256 (GB/s) 25 [+46%] AVX512DQW 17.1 SHA 5.9 AVX2 7.6 AVX2 Even Ryzen’s support for SHA hardware acceleration is not enough as memory bandwidth lets it down with SKL-X “only” 50% faster through AVX512.
BenchCrypt Crypto SHA1 (GB/s) 39.3 [+2.3x] AVX512DQW 17.3 SHA 11.3 AVX2 15.1 AVX2 SKL-X only gets faster with the simpler SHA1 and is now over 2-times faster.
BenchCrypt Crypto SHA2-512 (GB/s) 21.1 [+6.3x] AVX512DQW 3.34 AVX2 4.4 AVX2 5.34 AVX2 SHA2-512 is not accelerated by SHA HWA thus Ryzen is forced to use SIMD and loses badly.
Memory bandwidth rules here and SKL-X with its 4-channels of ~100GB/s bandwidth reigns supreme (we can only imagine what the 6-channel beast will score) – so Ryzen loses badly. Its ace card – support for SHA HWA is not enough to “save it” as AVX512 allows SKL-X to power through algorithms like a knife through butter. The 64-bit SHA2-512 test is sobbering with SKL-X no less than 6-times faster than Ryzen.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 320 [+36%] 234 129 157 In this non-vectorised test SKL-X is only 36% faster than Ryzen. SIMD would greaty help it here.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 277 [+40%] 198 108 131 Switching to FP64 code nothing much changes, SKL-X is just 40% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 66.9 [-21%] 85.1 27.2 37.8 Binomial uses thread shared data thus stresses the cache & memory system; somehow Ryzen manages to win this.
BenchFinance Binomial double/FP64 (kOPT/s) 65 [+41%] 45.8 25.5 33.3 With FP64 code the situation gets back to “normal” – with SKL-X again 40% faster than Ryzen.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 64 [+30%] 49.2 25.9 31.6 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; SKL-X is just 30% faster here.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 51 [+36%] 37.3 19.1 21.2 Switching to FP64 where Ryzen did so well – SKL-X returns to 40% faster.
Without the help of its SIMD engine, SKL-X is still 30-40% faster than Ryzen but over 2-times faster than HSW-E showing just how much the core has improved for complex code with lots of shared data (read-only or modifyable). While Ryzen thought it found its “niche” it has been already beaten…
BenchScience SGEMM (GFLOPS) float/FP32 343 [5x] FMA 68.3 FMA 109 FMA 185 FMA GEMM has not yet been updated for AVX512 but SKL-X is an incredible 5x faster!
BenchScience DGEMM (GFLOPS) double/FP64 124 [+2x] FMA 62.7 FMA 72 FMA 87.7 FMA Even without AVX512, with FP64 vectorised code, SKL-X still manages 2x faster.
BenchScience SFFT (GFLOPS) float/FP32 34 [+3.8x] FMA 8.9 FMA 18.9 FMA 18 FMA FFT has also not been updated to AVX512 but SKL-X is still 4x faster than Ryzen!
BenchScience DFFT (GFLOPS) double/FP64 19 [+2.5x] FMA 7.5 FMA 9.3 FMA 10.9 FMA With FP64 SIMD SKL-X is over 2.5x faster than Ryzen in this tough algorithm with loads of memory accesses.
BenchScience SNBODY (GFLOPS) float/FP32 585 [+2.5x] FMA 234 FMA 273 FMA 158 FMA NBODY is not yet updated to AVX512 but again SKL-X wins.
BenchScience DNBODY (GFLOPS) double/FP64 179 [+2x] FMA 87 FMA 79 FMA 40 FMA With FP64 code SKL-X is still 2-times faster than Ryzen.
With highly vectorised SIMD code, even without the help of AVX512, SKL-X is over 2.5x faster than Ryzen, but more than that – almost 4-times faster than its older HSW-E brother!
CPU Image Processing Blur (3×3) Filter (MPix/s) 1639 [+2.2x] AVX2 750 AVX2 655 AVX2 760 AVX2 In this vectorised integer AVX2 workload SKL-X is over 2x faster than Ryzen.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 711 [+2.2x] AVX2 316 AVX2 285 AVX2 345 AVX2 Same algorithm but more shared data does not change anything.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 377 [+2.2x] AVX2 172 AVX2 151 AVX2 188 AVX2 Again same algorithm but even more data shared does not change anything again.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 609 [+2.1x] AVX2 292 AVX2 271 AVX2 316 AVX2 Different algorithm but still SKL-X is still 2x faster than Ryzen.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 79.8 [+36%] AVX2 58.5 AVX2 35.4 AVX2 50.3 AVX2 Still AVX2 vectorised code but here Ryzen does much better, with SKL-X just 36% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 15.7 [+63%] 9.6 6.3 7.6 This test is not vectorised though it uses SIMD instructions and here SKL-X only manages to be 63% faster.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1000 [+17%] 852 422 571 Again in a non-vectorised test Ryzen just flies but SKL-X manages to be 20% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 190 [+29%] 147 75 101 In this final non-vectorised test Ryzen really flies but not enough to beat SKL-X which is 30% faster.
As with other SIMD tests, SKL-X remains just over 2-times faster than Ryzen and about as fast over HSW-E. But without SIMD it drops significantly to just 20-60% showing just how good Ryzen performs.

When using the new AVX512 instruction set – we see incredible performance with SKL-X about 3x faster than its Ryzen competitor and about 2x faster than the older HSW-E; with the older AVX2/FMA instruction sets supported by all CPUs, it is “only” about 2x faster. When using non-vectorised SIMD code its lead shortens to about 30-60%.

While we’ve not tested memory performance in this article, we see that in streaming tests its 4 DDR4 channels trounce 2-channel CPUs that just cannot feed all their cores. Being able to use much faster DDR4 memory (3200 vs 2133) allows it to also soundly beat its older HSW-E brother.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks i9-7900X (Skylake-X) Ryzen 1700X i7-6700K 4C/8T (Skylake)
i7-5820K (Haswell-E)
BenchDotNetAA .Net Dhrystone Integer (GIPS) 69.8 [+1.9x]
36.5 23.3 30.7 While Ryzen used to dominate .Net CLR workloads, now SKL-X is 2x faster than it and naturally older HSW-E.
BenchDotNetAA .Net Dhrystone Long (GIPS) 60.9 [+35%] 45.1 23.6 28.2 Ryzen seems to do very well here cutting SKL-X’s lead to just 35% – while still being almost 2x faster than HSW-E
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 112 [+12%] 100.6 47.4 65.4 Floating-Point CLR performance is pretty spectacular with Ryzen  and SKL-X only manages 12% faster.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 138 [+14%] 121.3 63.6 85.7 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with SKL-X just 14% faster.
While Ryzen used to dominate .Net workloads, SKL-X restores the balance in Intel’s favour – though in many tests it is just over 10% faster than Ryzen. The CLR definitely seems to prefer Ryzen.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 140 [+50%] 92.6 55.7 75.4 Just as we saw with Dhrystone, this integer workload sees a 50% improvement for SKL-X. While RiuJit supports SIMD integer vectors the lack of bitfield instructions make it slower for our code; shame.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 143 [+47%] 97.8 60.3 79.2 With 64-bit integer workload we see a similar story – SKL-X is about 50% faster.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 543 [+2x] AVX/FMA 272.7 AVX/FMA 12.9 284.2 AVX/FMA Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code – SKL-X strikes back to 2x faster than Ryzen.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 294 [+2x] AVX/FMAX 149 AVX/FMAX 38.7 176.1 AVX/FMA Switching to FP64 SIMD vector code – still running AVX/FMA – SKL-X is still 2x faster.
With RyuJIT’s support for SIMD vector instructions – SKL-X brings its power to bear, being the usual 2-times faster than Ryzen; it does not seem that RyuJIT supports AVX512 yet – something that will make it evern faster. With scalar instructions SKL-X is “only” 50% faster but still about 2x fasster than HSW-E.
Java Arithmetic Java Dhrystone Integer (GIPS) 716 [+39%] 513 313 395 Ryzen puts a strong performance with SKL-X “just” 40% faster. Still it’s almost 2x faster than HSW-E.
Java Arithmetic Java Dhrystone Long (GIPS) 873 [+70%] 514 332 399 Somehow SKL-X does better here with 70% faster than Ryzen.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 155 [+32%] 117
62.8 89 With a floating-point workload Ryzen continues to do well so SKL-X is again “just” 30% faster.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 160 [+25%] 128 64.6 91 With FP64 workload SKL-X’s lead drops to 25%.
With the JVM seemingly favouring Ryzen – and without SIMD – SKL-X is just 25-40% faster than it – but do note it absolutely trounces its older HSW-E brother – being almost 2x faster. So Intel has made big gains but at a cost.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 135 [+40%] 99 59.5 82 Oracle’s JVM does not yet support SIMD vectors so SKL-X is “just” 40% faster than Ryzen.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 132 [+41%] 93 60.6 79 With 64-bit integers nothing much changes.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 97 [+13%] 86 40.6 61 Scary times as SKL-X manages its smallest lead over Ryzen at just over 10%.

Intel better hope Oracle will add vector primitives allowing SIMD code to use the power of its CPU’s SIMD units.

Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 99 [+20%] 82 40.9 63 With FP64 workload SKL-X is lucky to increase its lead to 20%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA, AVX512) allows the competition to creep up on SKL-X in performance but at far lower cost. This is not a good place for Intel to be in.

While Ryzen used to dominate .Net and Java benchmarks – SKL-X restores the balance in Intel’s favour – through both the CLR and JVM do seem to “favour” Ryzen for some reason. If you are running the older HSW-E then you can be sure SKL-X is over 2x faster than it thoughout.

Thus thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run much better on SKL-X than older Intel designs but also reasonably well on Ryzen – at least if not using SIMD vector extensions when SKL-X’s power comes to the fore.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Just when AMD were likely celebrating their fantastic Ryzen, Intel strikes back with a killer – though really expensive CPU. While we’ve not seen major core advances since SandyBridge (SNB and SNB-E) and likely not even see anything new in Coffeelake (CFK) – somehow these improvements add up to quite a lot – with SKL-X soundly beating both Ryzen and its older HSW-E brother.

We finally see AVX512 released and it does not disappoint: SKL-X increases its lead by 50% through it, but note that lower-end CPUs will execute some instructions a lot slower which is unfortunate. Using AVX512 also requires new tools – either compiler which on Windows means the brand-new Visual C++ 2017 or assemblers – and decent amount of work – thus not something most developers will do – at least until the normal desktop/mobile platforms will support it too.

All in all it is a solid upgrade – though costly – but if performance you’re after you can “safely” remain with Intel – you don’t need to join the “rebel camp”. But we’ll need to see what AMD’s Threadripper has in store for us… 😉

Future performance with AVX512 in Sandra 2016 SP1

Intel Skylake

What is AVX512?

AVX512 is a new SIMD instruction set operating on 512-bit registers that is the natural progression from FMA/AVX (256-bit registers). It was first introduced with Intel’ “Phi” co-processor (Intel’s answer to GPGPUs) and now a version of it is making its way to CPUs themselves.

Why is AVX512 important?

CPU performance has only marginally increased (5-10%) from one generation to the next, with power efficiency being the primary goal; with limited options (cannot increase clocks speeds, must reduce power, hard to improve execution efficiency, etc.) exploiting data level parallelism through SIMD is a relatively simple way to improve performance.

SIMD instructions have long been used to increase performance (since the introduction of MMX with the Pentium in 1997!) and their register width has been increasing steadily from 64-bit (MMX) to 128-bit (SSEx) to 256-bit (AVX/FMA) and now to 512-bit (AVX512) – thus processing more and more data simultaneously.

Unfortunately, software has to be specifically modified to support AVX512 (or at the very least re-compiled) but developers are generally used to this these days after the SSE to AVX transition.

SiSoftware has thus been updating its benchmarks to AVX512, though some need compiler support and will need to wait until Microsoft updates its Visual C++ compiler at some point.

What CPUs will support AVX512?

It was rumoured that the newly released “Skylake” Core consumer CPUs were going to support AVX512 – but they do not. The future “Skylake-E” Xeon “Purley” server/workstation CPUs are supposed to support it.

AVX512 is actually a set of multiple sets – with “Skylake-E” supporting F (foundation) and CD (conflict detection), BW (byte & word), DQ (double-word and quad-word) and VL (vector length extension) – and future “Canonlake-E” supporting IFMA (integer FMA), VBM (vector byte manipulation) and perhaps others.

It is disappointing that AVX512 is not enabled on consumer CPUs (Core) but it will eventually appear in future iterations; gamers/enthusiasts need to buy into the “extreme/Skylake-E” platform and business users getting “Xeon/Skylake-E” in their workstations.

What kind of performance improvement can we expect with AVX512?

The transition from SSE 128-bit to AVX/FMA/AVX2 256-bit has – eventually – resulted in 70-120% improvement, with compute intensive code that seldom access memory yielding the best improvement. Note that AVX executes at lower clock than “normal”/SSE code.

AVX512 not only doubles width (512-bit) but also number of registers (32 vs 16) thus we can hold 4x (four times) more data which may reduce cache/memory accesses by caching more data locally. But AVX512 code will again run at lower clock versus AVX/FMA.

In the next examples we project future gains through AVX512 for common algorithms as implemented in Sandra’s benchmarks and what they might mean to customers.

Can I test AVX512 performance with Sandra?

Yes, with the release of Sandra 2016 SP1 – you can now test AVX512 performance – naturally you need the required CPU. All the low-level benchmarks (below) have been ported to AVX512:

  • Multi-Media (Fractal Generation) Benchmark: AVX512 F, BW, DQ supported now
  • Cryptography (SHA Hashing) Benchmark: AVX512 BW, DQ supported now
  • Memory & Cache Bandwidth Benchmarks: AVX512 F, DQ supported now

The following benchmarks require future compiler support (Microsoft VC++) and have not been released at this time:

  • Financial Analysis (Black-Scholes, Binomial, Monte-Carlo): AVX512 F support coming soon
  • Scientific Analysis (GEMM, FFT, N-Body): AVX512 F support coming soon
  • Image Processing (Blur/Sharpen/Motion-Blur, Sobel, Median): AVX512 BW support coming soon
  • .Net Vectorised (Fractal Generation): AVX512 support dependent on RyuJIT numerics libraries that need to be updated by Microsoft. No changes required.

Hardware Stats

We are comparing two released public CPUs with their projected next-gen counterparts supporting AVX512.

Processor Intel i7-6700K (Skylake) Intel i7-77XX? (next-gen) Intel i7-5820K (Haswell-E) Intel i7-78XX? (Skylake-E)
Cores/Threads 4C / 8T 4C / 8T 6C / 12T 6C / 12T
Clock Speeds (MHz) Min-Max-Turbo 800-4000-4200 assumed same 1200-3300-3600 assumed same
Caches L1/L2/L3 4x 32kB, 4x 256kB, 8MB assumed same 6x 32kB, 6x 256kB, 15MB assumed same
Power TDP Rating (W) 91W assumed same 140W assumed same
Instruction Set Support AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc. AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc.

We do not expect major changes in future AVX512 supporting arch, especially with Skylake-E as Core Skylake is already out and the core specifications are known.

Multi-media (Fractal Generation) Benchmark

Benchmark Future Core-i7 (4C/8T AVX512) Projected Core i7-6700K (4C/8T AVX2/FMA) Core i7-6700K (4C/8T SSEx) Future Core i7-E (6C/12T AVX512) Projected Core i7-5820K (6C/12T AVX2/FMA) Core i7-5820K (6C/12T SSEx))
 AVX512 Multi-Media
Integer SIMD (Mpix/s) 912.5 [+76% over AVX] 516.2 [+76% over SSE] 292 1020.7 [+76% over AVX] 577.4 [+76% over SSE] 327
We see around 76% improvement from AVX2 vs. SSE, thus we assume we’ll see something similar moving to AVX512 (~80%).
Long SIMD (Mpix/s) 315.3 [+66% over AVX] 190.1 [+66% over SSE] 114.6 284.3 [+66% over AVX] 171.4 [+66% over SSE] 87.6
We see around 66% improvement from AVX2 vs. SSE, but due to the new instructions we may see better AVX512 gains.
Single Float SIMD (Mpix/s) 916.8 [+2x over AVX] 458.4 [+2.12x over SSE] 216 1079 [+2x over AVX] 539.5 [+2.12x over SSE] 234.8
We saw over 2x improvement from AVX/FMA over SSE so while we may not see such a large improvement with AVX512, we may still get 100%.
Double Float SIMD (Mpix/s) 545.8 [+2x over AVX] 272.9 [+2.35x over SSE] 116.1 622.4 [+2x over AVX] 311.2 [+2.35x over SSE] 126
We see even better improvement from AVX to SSE here (2.35x) so hopefully we’ll get 2x moving to AVX512.
Quad Float SIMD (Mpix/s) 20.3 [+94% over AVX] 10.5 [+94% over SSE] 5.4 622.4 [+94% over AVX] 311.2 [+94% over SSE] 126
Emulating fp128 is hard work but even then AVX is 94% faster than SSE and thus we’d expect AVX512 to be almost 2x faster still.
Despite some being disappointed by arch-to-arch performance improvement, the Skylake 4C (i7-6700K) already goes toe-to-toe with Haswell-E 6C (i7-5820K), but with AVX512 support Skylake-E 6C/8C is projected to comprehensively outperform it.

AVX512 will also allow Skylake-E to narrow the gap between it and current GPGPUs with multi-CPU Xeon systems able to “do without” GPGPUs – well except perhaps a “Phi” or two?

 AVX512 Crypto
Hashing SHA2-256 (GB/s) 11.80 [+2x over AVX] 5.90 [+2.36x over SSE] 2.50 13.60 [+2x over AVX] 6.80 [+2.26x over SSE] 3
We see a large 2.26-2.36x improvement of AVX2 vs. SSE, thus we expect about 2x increase with AVX512 still.
Hashing SHA1 (GB/s) 23 [+2x over AVX] 11.5 [+2.16x over SSE] 5.33 27.70 [+2x over AVX] 13.85 [+2.04x over SSE] 6.79
Even with SHA1 we see a good 2.04-2.16x improvement of AVX2 vs. SSE, thus AVX512 should again double performance though we may be limited by memory bandwidth.
Hashing SHA2-512 (GB/s) 8.74 [+2x over AVX] 4.37 [+2.33x over SSE] 1.87 9.60 [+2x over AVX] 4.80 [+2.20x over SSE] 2.18
Switching to 64-bit integer SHA512 we see the best improvement yet of AVX2 vs SSE (2.2-2.33x) with AVX512 likely to improve by 2x yet again.
With hashing we see even better results than even fractal generation, with AVX2 improving over 2x over SSE – and AVX512 will thus improve by at least 100% – if anything it is likely we will hit memory bandwidth limitations.
 AVX512 Memory Bandwidth
Memory Bandwidth (GB/s) ~31.30 31.30 [0%] 31.30 ~42.00 [0%] 42.30 [-1%] 42.6
Even with DDR4 the memory sub-system hasn’t changed much and despite 512-bit transfers with AVX512 there is really no performance delta in streaming data to/from memory.
L3 Bandwidth (GB/s) ~267.97 [+10%] 243.30 [+10%] 220.90 ~202.20 [+3%] 195.90 [+3%] 189.8
As we move up the cache hierarchy, the L3 already shows a 10% bandwidth improvement using AVX2/FMA vs. SSE and AVX512 improving performance further.
L2 Bandwidth (GB/s) ~392.50 [+21%] 323.30 [+21%] 266.30 ~536.81 [+20%] 444.10 [+20%] 367.4
As we expected, L2 bandwidth improves ~20% with AVX2/FMA and likely to improve further.
L1D Bandwidth (GB/s) ~1,364.25 [+50%] 909.50 [+2.11x] 429.90 ~1,536.00 [+50%] 1,024.00 [+2x] 518
Skylake has widened the data access ports (just like Haswell before it), thus 512-bit AVX512 transfers show the best improvement yet, 40-50%!
AVX512 does help take advantage of the widened data ports in Skylake and future arch, with L1D cache showing the best bandwidth improvement just like Haswell before it (with AVX2).

Memory bandwidth is still limited by DDR4 speeds but faster modules are coming out all the time but this time their clocks are JEDEC ratified.

We will update the article with future (projected) results once more benchmarks are converted to AVX512 – once compiler support is released – but even so far we see excellent performance improvement.

Until then, those of you with access to AVX512 supporting hardware can download Sandra 2016 SP1 and test away!