Intel Core i9 (SKL-X) Review & Benchmarks – 4-channel @ 3200Mt/s Cache & Memory Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels (up to 6 64-bit) for huge memory bandwidth.

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

SKL-X does not only support DDR4 but also NVM-DIMMs (non-volatile memory DIMMs) and PMem (Persistent Memory) that should revolutionise future computing with no need for memory refresh or immediate sleep/resume (no need to save/restore memory from storage).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
TLB 4kB pages
64 4-way / 64 8-way
1536 8-way
64 full-way
1536 8-way
64 4-way / 64 8-way
1536 6-way
64 4-way
1024 8-way
Ryzen has comparatively ‘better’ TLBs than all Intel CPUs.
TLB 2MB pages
8 full-way
1536 2-way
64 full-way
1536 2-way
8 full-way
1536 6-way
8 full-way
1024 8-way
Again Ryzen has ‘better’ TLBs than all Intel versions
Memory Controller Speed (MHz) 800-3300 600-1200 800-4000 1200-4000 Intel’s UNC clock runs higher than Ryzen
Memory Speed (Mhz) Max
3200 / 2667 2400 / 2667 2533 /2667 2133 / 2133 SKL-X can officially go as high as Ryzen and normal SKL @ 2667 but runs happily at 3200Mt/s.
Memory Channels / Width
4 / 256-bit (max 8 / 384-bit) 2 / 128-bit 2 / 128-bit 4 / 256-bit SKL-X has 2 memory controllers each with up to 3 channels each for massive memory bandwidth.
Memory Timing (clocks)
16-18-18-36 6-54-19-4 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T SKL-X can run as tight timings as normal SKL or Ryzen.

Core Topology and Testing

Intel has dropped the (dual) ring bus(es) and instead opted for a mesh inter-connect between cores; on desktop parts this should not cause latency differences between cores (as with Ryzen) but on high-end server parts with many cores (up to 28) this may not be the case. The much increased L2 cache (1MB vs. old 256kB) should alleviate this issue – though the L3 cache seems to have been reduced quite a bit.

Native Performance

We are testing bandwidth and latency performance using all the available SIMD instruction sets (AVX, AVX2/FMA, AVX512) supported by the CPUs.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 87 [+85%] 47.7 39 46 With 10 cores SKL-X has massive aggregated inter-core bandwidth, almost 2x Ryzen or HSW-E.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 19 [+46%] 13 16 17 In worst-case pairs  SKL-X does well but not far away from normal SKL or HSW-E.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.2 15.7 16 13.4 [-12%]
Within the same core all modern CPUs seem to have about 15-16ns latency.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 80 45 [-43%] 49 58 Surprisingly we see massive latency increase almost 2x Ryzen or SKL.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 Naturally Ryzen scores worst when going off-CCX.
It seems the mesh inter-connect between cores has decent bandwidth but much higher latency than the older HSW-E or even the current SKL.
Aggregated L1D Bandwidth (GB/s) 2200 [+3x] 727 878 1150 SKL-X has 512-bit data ports thus massive L1D bandwidth over 2x HSW-E and 3x over Ryzen.
Aggregated L2 Bandwidth (GB/s) 1010 [+81%] 557 402 500 The large L2 caches also have 2x more bandwidth than either HSW-E or Ryzen.
Aggregated L3 Bandwidth (GB/s) 289 392 [+35%] 247 205 The 2 Ryzen L3 caches have higher bandwidth than all Intel CPUs.
Aggregated Memory (GB/s) 69.3 [+2.4x] 28.5 31 42.5 With its 4 channels SKL-X reigns supreme with almost 2.5x more bandwidth than Ryzen.
The widened ports on the L1 and L2 caches allow SKL-X to demolish the competition with over 2x more bandwidth than either Ryzen or older HSW-E; only the smaller L3 cache falters. Its 4 channels running at 3200Mt/s yield huge memory bandwidth that greatly help streaming algorithms. SKL-X is a monster – we can only speculate what the server 6-channel version would score.
Data In-Page Random Latency (ns) 26 [1/2.84x] (4-13-33) 74 (4-17-36) 20 (4-12-21) 25 (4-12-26) SKL-X has comparable lantecy with SKL and HSW-E and much better than Ryzen.
Data Full Random Latency (ns) 75 [-21%] (4-13-70) 95 (4-17-37) 65 (4-12-34) 72 (4-13-52) Full random latencies are a bit higher than expected but on part with HSW-E and better than Ryzen.
Data Sequential Latency (ns) 5.4 [+28%] (4-11-13) 4.2 (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Strangely SKL-X does not do as well as SKL here or Ryzen but at least it beats HSW-E.
If you were hoping SKL-E to match normal SKL that is sadly not the case even at similar Turbo clock they are higher across the board, even allowing Ryzen a win. Perhaps further platform optimisations are needed.
Code In-Page Random Latency (ns) 12 [-27%] (4-14-28) 16.6 (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code SKL-X performs better though not enough to catch normal SKL.
Code Full Random Latency (ns) 86 [-15%] (4-16-106) 102 (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page code latency takes a bigger hit but nothing to worry about.
Code Sequential Latency (ns) 6.5 [-27%] (4-7-12) 8.9 (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Again nothing much changes here.
SKL-X again does not manage to match normal SKL but soundly trounces both Ryzen and its older HSW-E brother, delivering a good result overall. Code access seems to perform more consistently than data for some reason we need to investigate.
Memory Update Transactional (MTPS) 52.2 [+12x] HLE 4.23 32.4 HLE 7 SKL-X with working HLE is over 12-times faster than Ryzen and older HSW-E.
Memory Update Record Only (MTPS) 57.2 [+13.6x] HLE 4.19 25.4 HLE 5.47 SKL-X is king of the hill with nothing getting close.
Yes – Intel has finally fixed HLE/RTL which owners of HSW-E and BRW-E must feel very hard done by considering it was “working” before having it disabled due to the errata. Thus after so many years we have both HLE, RTL and AVX512! Great!

If there was any doubt, SKL-X does not disappoint – massive cache (L1D and L2) aggregate and memory bandwidths with server versions likely even more; the smaller L3 cache does falter though which is a bit of a surprise – the larger L2 caches must have forced some compromises to be made.

Latency is a bit disappointing compared to the “normal” SKL/KBL we have on desktop, but are still better than older HSW-E and also Ryzen competitor. Again the L1 and L2 caches (despite being 4-times bigger) clock latencies are OK with the L3 and memory controller being the source of the increased latencies.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

After a strong CPU performance we did not expect the cache and memory performance to disappoint – and it does not. SKL-X is a big improvement over the older versions (HSW-E) and competition with few weaknesses.

The mesh interconnect does seem to exhibit higher inter-core latencies with small increase in bandwidth; perhaps this can be fixed.

The very much reduced L3 cache does disappoint both bandwidth and latency wise; the memory controllers provide huge bandwidth but at the expense of higher latencies.

All in all, if you can afford it, there is no question that SKL-X is worth it. But better wait to see what AMD’s Threadripper has in store before making your choice… 😉

AMD Ryzen Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen”?

“Ryzen” (code-name ZP aka “Zeppelin”) is the latest generation CPU from AMD (2017) replacing the previous “Vishera”/”Bulldozer” designs for desktop and server platforms. An APU version with an integrated (GP)GPU will be launched later and likely include a few improvements as well.

This is the “make or break” CPU for AMD and thus greatly improve performance, including much higher IPC (instructions per clock), higher sustained clocks, better Turbo performance and “proper” SMT (simultaneous multi-threading). Thus there are no longer “core modules” but proper “cores with 2 SMT threads” so an “eight-core CPU” really sports 8C/16T and not 4M/8T.

The desktop version uses the AM4 socket and requires new boards based on various south-bridges with names similar to Intel’s aka X370 (performance), B350 (mainstream), etc.

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Ryzen 1700X
Intel 6700K (Skylake)
Intel 5820K (Haswell-E) Comments
TLB 4kB pages
64 full-way

1536 8-way

64 8-way

1536 6-way

64 4-way

1024 8-way

Ryzen has comparatively ‘better’ TLBs than even SKL while the 2-generation older HSW-E is showing its age.
TLB 2MB pages
64 full-way

1536 2-way

8 full-way

1536 6-way

8 full-way

1024 8-way

Nothing much changes for 2MB pages with Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-4000 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2133 / 2133 Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL supports only up to 2400 officially but happily runs at 2533MHz; old HSW-E can only do 2133MHz but with 4 memory channels.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 4 / 256-bit HSW-E leads with 4 memory channels of DDR4 providing massive bandwidth for its cores; however both Ryzen and Skylake can use faster DDR4 memory reducing this problem somewhat.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T Despite faster memory Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 47.7 [+3%] 39 46 With 8 cores (and thus 8 pairs) Ryzen’s bandwidth matches the 6-core HSW-E and is 22% higher than SKL thus decent.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 13 [-23%] 16 17 In worst-case pairs on Ryzen must go across CCXes while on Intel CPUs they can still use L3 cache to exchange data – thus it ends up about 23% slower than both SKL and HSW-E but it is not catastrophic.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.7 [-2%] 16 13.4 Within the same core (sharing L1D/L2) , Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 45 [-8%] 49 58 Within the same compute unit (sharing L3), the latency is ~45ns a bit lower than either SKL and much lower than HSW-E thus so far so good!
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 [+3x] Going inter-CCX increases the latency by 3 times to about 130ns thus if the threads are not properly scheduled you can end up with far higher latency and far less bandwidth.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled to avoid inter-CCX transfers where possible. As the CCX link runs at UMC speed using faster memory increases link bandwidth and decreases its latency which helps no end.
Aggregated L1D Bandwidth (GB/s) 727 [-17%] 878 1150 SKL has 512-bit data ports so Ryzen cannot compete with that but it does well against BRW-E.
Aggregated L2 Bandwidth (GB/s) 557 [+38%] 402 500 The 8 L2 caches have competitive bandwidth thus overall Ryzen has though BRW-E does well.
Aggregated L3 Bandwidth (GB/s) 392 [+58%] 247 205 Even spread over the 2 CCXes the L3 caches have huge aggregated bandwidth  – over over SKL.
Aggregated Memory (GB/s) 28.5 [-8%] 31 42.5 Running at lower memory speed Ryzen cannot beat SKL nor BRW-E with its 4 memory controllers but has higher comparative efficiency.
The 8x L2 caches and 2x L3 caches have much higher aggregated latency than either Intel CPU while the memory controller is also more efficient though it cannot compete with 4-channel BRW-E. But its 8x L1D caches are not “wide enough” to compete with SKL’s widened data ports (again widened in HSW). This may be one reason SIMD performance is not as high with Ryzen and AMD may have to widen them going forward especially when adding AVX512 support.
Data In-Page Random Latency (ns) 74 [+2.9x] (4-17-36) 20 (4-12-21) 25.3 (4-12-26) In-page latency is surprisingly large, almost 3x old SNB-E and ~4x SKL! Ryzen’s TLBs seem slow. L1 and L2 latencies are comparative (4 and 17 clk) but L3 latency is already ~50% higher than HSW and almost 2x SKL.
Data Full Random Latency (ns) 95 [+31%] (4-17-37) 65 (4-12-34) 72 (4-13-52) Out-of-page latencies are ‘better’ with Ryzen ‘only’ ~30% slower than HSW-E and about 50% slower than SKL. Again L1 is and L2 are fine (4 and 17 clk) and L3 is comparative to SKL (37 vs 34 clk) while old HSW-E trails (52 clk)!
Data Sequential Latency (ns) 4.2 [+1%] (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency at ~4ns matching SKL and much better than old HSW-E.
We finally discover an issue – Ryzen’s memory latencies (in-page) are very high compared to the competition – TLB issue? Fortunately sequential and out-of-page performance is fine so perhaps its memory prefetchers can alleviate the problem somewhat but it is something that will need to be addressed
Code In-Page Random Latency (ns) 16.6 [+5%] (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code we don’t see the same problem – with in-page latency matching HSW-E, though still about 50% higher than SKL. The twice as large (64kB) L1I cache seems to have the same (4ckl) latency as SKL’s. No issues with L2 nor L3 latencies either.
Code Full Random Latency (ns) 102 [+20%] (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page latency is a bit higher than both Intel CPUs, ~50% higher than SKL but nothing as bad as we’ve seen with data.
Code Sequential Latency (ns) 8.9 [-12%] (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns comparative to HSW-E but again about ~67% higher than SKL.
While code access latencies are higher than the new SKL – they are comparative with the older HSW-E and nowhere near as bad as that we’ve seen with data. Even the twice as large L1I (L1 instruction cache) behaves itself with 4clk latency similar to Intel’s L1I smaller versions. It is thus a mystery with data is affected but not code.
Memory Update Transactional (MTPS) 4.23 [-39%] 32.4 HLE 7 SKL is in a World of its own due to support for HLE/RTM but Ryzen is still about 40% slower than HSW-E with just 6 cores.
Memory Update Record Only (MTPS) 4.19 [-23%] 25.4 HLE 5.47 With only record updates the difference drops to about 20% but again HLE shows its power for transaction processing.
Without HLE/RTM Ryzen is not going to win against Intel’s latest – but then again HLE/RTM are disabled in all but the very top-end CPUs – not to mention killed in previous HSW and BRW architectures so it is not a big problem. But if future models were to enable it, Intel will have a big problem on its hands…

Ryzen’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (8 vs 6 or 4); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that Ryzen’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs, especially the older HSW (and thus BRW) cores while the newer SKL (and thus KBL) cores sporting improved caches and TLBs which are hard for Ryzen to beat. Still it’s nothing to be worried about and perhaps AMD will be able improve things further with microcode/firmware updates if not new steppings and models (e.g. the APU model 10).

Overall we’d still recommend Ryzen over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates. The platform has a bright future with more CPUs destined to use the AM4 socket while both 1551 (SKL/KBL) and 2011 (HSW-E/BRW-E) platforms due to be replaced again with no future upgrades.