AMD Threadripper Review & Benchmarks – CPU 16-core Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on.

AMD Epyc/Threadripper DieIn this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Threadripper (1950X) with HEDT competition (Intel SKL-X) as well as normal desktop solutions (Ryzen, Skylake) which also serves to compare HEDT with the “normal” desktop solution.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
Cores (CU) / Threads (SP) 16C / 32T 10C / 20T 8C / 16T 4C / 8T Just as Ryzen, TR has the most cores though Intel has just announced new SKL-X with more cores.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 0.8-4.0-4.2GHz (8x-40x-42x) SKL has the highest base clock but all CPUs have similar Turbo clocks
Power (TDP) 180W 150W 95W 91W TR has higher TDP than SKL-X just like Ryzen so may need a beefier cooling system
L1D / L1I Caches 16x 32kB 8-way / 16x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way TR and Ryzen’s instruction caches are 2x data (and SKL/X) but all caches are 8-way.
L2 Caches 16x 512kB 8-way (8MB total) 20x 1MB 16-way (20MB total) 8x 512kB 8-way (4MB total) 4x 256kB 8-way (1MB total) SKL-X has really pushed the boat out with a 1MB L2 cache that dwarfs all other CPUs.
L3 Caches 4x 8MB 16-way (32MB total) 13.75MB 11-way 2x 8MB 16-way (16MB total) 8MB 16-way TR actually has 2 sets of 2 L3 caches rather than a combined L3 cache like SKL/X.
NUMA Nodes
2x 16GB each no, unified 32GB no, unified 16GB no, unified 16GB Only TR has 2 NUMA nodes

Thread Scheduling and Windows

Threadripper’s topology (4 cores in each CCX, with 2 CCX in one node and 2 nodes) makes things even more compilcated for operating system (Windows) schedulers. Effectively we have a 2-tier NUMA SMP system where CCXes are level 1 and nodes are level 2 thus the scheduling of threads matters a lot.

Also keep in mind this is a NUMA system (2 nodes) with each node having its own memory; while for compatibility AMD recommends (and the BIOS defaults) to “UMA” (Unified) “interleaving across nodes” – for best performance the non-interleaving mode (or “interleaving across CCX”) should be used.

What all this means is that you likely need a reasonably new operating system – thus Windows 10 / Server 2016 – with a kernel that has been updated to support Ryzen/TR as Microsoft is not likely to care about old verions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen/TR support all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 447 [-2%] 454 226 186 TR can keep up with SKL-X and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+1%] 456 236 184 An Int64 load does not change results.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 352 [+30%] 269 184 107 Finally TR soundly beats SKL-X by 30% and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 295 [+32%] 223 154 89 With a FP64 work-load the lead inceases slightly.
Unlike Ryzen which soundly dominated Skylake (albeit with 2x more cores, 8 vs. 4), Threadripper does not have the same advantage (16 vs. 10) thus it can only beat SKL-X in floating-point work-loads where it is 30% faster, still a good result.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 918 [-22%] 1180 535 527 With AVX2/FMA SKL-X is just too strong, with TR 22% slower.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 307 [-29%] 435 161 191 With Int64 AVX2 TR is almost 20% slower than SKL-X.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 7 [+30%] 5.4 3.6 2 This is a tough test using Long integers to emulate Int128 without SIMD and here TR manages to be 30 faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 996 [=] 1000 518 466 In this floating-point AVX2/FMA vectorised test  TR manages to tie with SKL-X.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 559 [-10%] 622 299 273 Switching to FP64 SIMD code, TR is now 10% slower than SKL-X.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 27 [+12%] 24 13.7 10.7 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – TR manages a 12% win.
In vectorised AVX2/FMA code we see TR lose in most tests, or tie in one – and only shine in emulation tests not using SIMD instruction sets. Intel’s SIMD units – even without AVX512 that SKL-X brings – are just too strong for TR just as we saw Ryzen struggle against normal Skylake.
BenchCrypt Crypto AES-256 (GB/s) 27.1 [-21%] 34.4  14  15 All CPUs support AES HWA – but TR/Ryzen memory is just 2400Mt/s vs 3200 that SKL-X enjoys (+33%) thus this is a good result; TR seems to use its channels pretty effectively.
BenchCrypt Crypto AES-128 (GB/s)  27.4 [-18%]  33.5  14  15 Similar to what we saw above TR is just 18% slower which is a good result; unfortunately we cannot get the memory over 2400Mt/s.
BenchCrypt Crypto SHA2-256 (GB/s)  32.2 [+2.2x]
 14.6  17.1  5.9 Like Ryzen, TR’s secret weapon is SHA HWA which allows it to soundly beat SKL-X over 2.2x faster!
BenchCrypt Crypto SHA1 (GB/s) 34.2 [+30%] 26.4  17.7  11.3 Even with SHA HWA, the multi-buffer AVX2 implementation allows SKL-X to beat TR by 16% but it still scores well.
BenchCrypt Crypto SHA2-512 (GB/s)  6.34 [-41%]  10.9  3.35  4.38 SHA2-512 is not accelerated by SHA HWA (version 1) thus TR has to use the same vectorised AVX2 code thus is 41% slower.
TR’s secret crypto weapon (as Ryzen) is SHA HWA which allows it to soundly beat SKL-X even with 33% less memory bandwidth; provided software is NUMA-enabled it seems TR can effectively use its 4-channel memory controllers.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 436 [+35%] 322  234.6  129 In this non-vectorised test TR bets SKL-X by 35%. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s)  366 [+32%]
277  198.6  109 Switching to FP64 code,TR still beats SKL-X by over 30%. So far so great.
BenchFinance Binomial float/FP32 (kOPT/s)  165 [+2.46x]
 67.3  85.6  27.25 Binomial uses thread shared data thus stresses the cache & memory system; we would expect TR to falter – but nothing of the sort – it is actually over 2.5x faster than SKL-X leaving it in the dust!
BenchFinance Binomial double/FP64 (kOPT/s)  83.7 [+27%]
 65.6  45.6  25.54 With FP64 code the situation changes somewhat – TR is only 27% faster but still an appreciable lead. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s)  91.6 [+42]
 64.3  49.1  25.92 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; TR reigns supreme being 40% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s)  68.7 [+34%]
 51.2  37.1  19 Switching to FP64, TR is just 34% faster but still a good lead
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads TR soundly beats SKL-X by a big margin that even a 16-core version may not be able to match. But should these tests be vectorisable using SIMD – especially AVX512 – then we would fully expect Intel to win. But for now – for financial workloads there is only one choice: TR/Ryzen!!!
BenchScience SGEMM (GFLOPS) float/FP32  165 [?] 623  240.7  268 We need to implement NUMA fixes here to allow TR to scale.
BenchScience DGEMM (GFLOPS) double/FP64  75.9 [?]  216  102.2  92.2 We need to implement NUMA fixes here to allow TR to scale.
BenchScience SFFT (GFLOPS) float/FP32  16.6 [-51%]  34.3  8.57  19 FFT is also heavily vectorised but stresses the memory sub-system more; here TR cannot beat SKL-X and is 50% slower – but scales well against Ryzen.
BenchScience DFFT (GFLOPS) double/FP64  8 [-65%]  23.18  7.6  11.13 With FP64 code, the gap only widens with TR over 65% slower than SKL-X and little scaling over Ryzen.
BenchScience SNBODY (GFLOPS) float/FP32  456 [-22%]  587  234  272 N-Body simulation is vectorised but has many memory accesses to shared data – and here TR is only 22% slower than SKL-X but again scales well vs Ryzen.
BenchScience DNBODY (GFLOPS) double/FP64  173 [-2%]  178  87.2  79.6 With FP64 code TR almost catches up with SKL-X
With highly vectorised SIMD code TR cannot do as well – but an additional issue is that NUMA support needs to be improved – F/D-GEMM shows how much of a problem this can be as all memory traffic is using a single NUMA node.
CPU Image Processing Blur (3×3) Filter (MPix/s)  1470 [-6%] 1560  775  634 In this vectorised integer AVX2 workload TR does surprisingly well against SKL-X just 6% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  617 [-10%]  693  327  280 Same algorithm but more shared data used sees TR now 10%, more NUMA optimisations needed.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  361 [-6%]  384  192  154 Again same algorithm but even more data shared now TR is 6% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  570 [-6%]  609  307  271 Different algorithm but still AVX2 vectorised workload – TR is still 6% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  106 [+35%]  78.3  57.3  34.9 Still AVX2 vectorised code but TR does far better, it is no less than 35% faster than SKL-X!
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  37.8 [-17%]  45.8  20  18.1 TR does worst here, it is 17% slower than SKL-X but still scales well vs. Ryzen.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1260 [?]  4260  1160  2280 This 64-bit SIMD integer workload is a problem for TR but likely NUMA issue again as not much scaling vs. Ryzen.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 420 [-45%]  777  175  359 TR really does not do well here but does scale well vs. Ryzen, likely some code optimisation is needed.

As TR (like Ryzen) supports most modern instruction sets now (AVX2, FMA, AES/SHA HWA) it does well but generally not enough to beat SKL-X; unfortunately the latter with AVX512 can potentially get even faster (up to 100%) increasing the gap even more.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) – even more memory bandwidth is needed to feed all the 16 cores (32 threads) and being able to run the memory at higher speeds would be appreciated.

NUMA support is crucial – as non-NUMA algorithms take a big hit (see GEMM) where performance can be even lower than Ryzen. While complex server or scientific software won’t have this problem, most programs will not be NUMA aware.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS)  111 [+88%]  59  61.5  29 .Net CLR integer performance starts off very well with TR 88% faster than SKL-X an incredible result! This is *not* a fluke as Ryzen scores incredibly too.
BenchDotNetAA .Net Dhrystone Long (GIPS) 62.9 [+3%]  61  41  29 TR cannot match the same gain with 64-bit integer, but still just about manages to beat SKL-X.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS)  193 [+82%]  106  103  50 Floating-Point CLR performance is pretty spectacular with TR (like Ryzen) dominating – it is no less than 82% faster than SKL-X!
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS)  225 [+67%]  134  111  63 FP64 performance is also great with TR 67% faster than SKL-X an absolutely huge win!
It’s pretty incredible, for .Net applications TR – like Ryzen – is king! It is pretty incredible that is is between 60-80% faster in all tests (except 64-bit integer). With more and more applications (apps?) running under the CLR, TR (like Ryzen) has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s)  195 [+38%]
141  92.6  53.4 In this non-vectorised test, TR is almost 40% faster than SKL-X not as high as what we’ve seen before but still significant.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s)  192 [+34%]
 143  97.6  56.5 With 64-bit integer workload this time we see no changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s)  626 [+27%]
 491  347  241 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units but TR is a comfortably 27% faster than it.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s)  344 [+14%]
 301  192  135 Switching to FP64 SIMD vector code – still running AVX/FMA – TR’s lead falls to 14% but it is still a win!
Taking advantage of RyuJit’s support for vectors/SIMD (through SSE2, AVX/FMA) allows SKL-X to gain some traction – TR remains very much faster up to 40%. Whatever the workload, it seems TR just loves it.
Java Arithmetic Java Dhrystone Integer (GIPS)  1000 [+16%]  857 JVM integer performance is only 16% faster on TR than SKL-X – but a win is a win.
Java Arithmetic Java Dhrystone Long (GIPS)  974 [+26%]  771 With 64-bit integer workloads, TR is now 26% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS)  231 [+48%]  156 With a floating-point workload TR increases its lead to a massive 48%, a pretty incredible result.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS)  183 [+14%]  160 With FP64 workload the gap reduces way down to 14% but it is still faster than SKL-X.
Java performance is not as incredible as we’ve seen with .Net, but TR is still 15-50% faster than SKL-X – no mean feat! Again if you have Java workloads, then TR should be the CPU of choice.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s)  200 [+45%]  137 The JVM does not support SIMD/vectors, thus TR uses its scalar prowess to be 45% faster.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s)  186 [+33%]  139 With 64-bit vectorised workload Ryzen is still 33% faster.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s)  169 [+69%]  100 With floating-point, TR is a massive 69% faster than SKL-X a pretty incredible result.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s)  159 [+59%]  100 With FP64 workload TR’s lead falls just a little to 59% – a huge win over SKL-X.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives TR (like Ryzen) free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

TR (like Ryzen) absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than the latest Intel SKL-X – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on TR. For .Net and Java code, TR is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It may be difficult to decide whether AMD’s design (multiple CCX units, multiple dies/nodes on a socket) is “cool” and supporting it effectively is not easy for programmers – be they OS/kernel or application – but when it works it works extremely well! There is no doubt that Threadripper can beat Skylake-X at the same cost (approx 1,000$) though using more coress just as its little (single-die) brother Ryzen.

Scalar workloads, .Net/Java workloads just fly on it – but highly vectorised AVX2/FMA workloads only perform competitively; unfortunately once AVX512 support is added SKL-X is likely to dominate effectively these workloads though for now it’s early days.

It’s multiple NUMA node design – unless running in UMA (unified) mode – requires both OS and application support, otherwise performance can tank to Ryzen levels; while server and scientific programs are likely to be so – this is a problem for most applications. Then we have its dual-CCX design which further complicate workloads, effectively being a 2nd NUMA level; we can see inter-core latencies being 4 tiers while SKL-X only has 2 tiers.

In effect both platforms will get better in the future: Intel’s SKL-X with AVX512 support and AMD’s Threadripper with NUMA/CCX memory optimisations (and hopefully AVX512 support at one point). Intel are also already launching newer versions with more cores (up to 18C/36T) while AMD can release some server EPYC versions with 4 dies (and thus up to 32C/64T) that will both push power envelopes to the maximum.

For now, Threadripper is a return to form from AMD.

AMD Threadripper Review & Benchmarks – 4-channel DDR4 Cache & Memory Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on. The large socket allows for 4 DDR4 memory channels greatly increasing bandwidth over Ryzen, just as with Intel.

AMD Threadripper die

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
TLB 4kB pages
64 full-way
1536 8-way
64 8-way
1536 6-way
64 full-way
1536 8-way
64 8-way
1536 6-way
TR/Ryzen has comparatively “better” TLBs 8-way vs 6-way and full-way vs 8-way.
TLB 2MB pages
64 full-way
1536 2-way
8 full-way
1536 6-way
64 full-way
1536 2-way
8 full-way
1536 6-way
Nothing much changes for 2MB pages with TR/Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-3300 600-1200 800-4000 TR/Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2400 / 2666 2533 / 2400 TR/Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL/X supports only up to 2400 officially but happily runs at 3200MHz a big advantage.
Memory Channels / Width
4 / 256-bit 4 / 256-bit 2 / 128-bit 2 / 128-bit Both TR and SKL-X enjoy 256-bit memory channels.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Despite faster memory, TR/Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on TR/Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

In addition, Threadripper is a NUMA SMP design – with the other nodes effectively different CPUs; thus sharing data between cores on different nodes is equivalent to different CPUs in a SMP system.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). TR (like Ryzen) supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s)  92.2 [+7%]  85.5  47.2  39.5 With 16 cores (and thus 16 pairs) TR’s inter-core bandwidth beats SKL-X by over 7% – assuming threads are scheduled correctly.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 7.51 [1/3]  24.4  5.75  16 In worst-case pairs on TR go not to just different CCX but NUMA nodes thus bandwidth is 1/3 that of SKL-X.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns)  15.4 [-1%]
15.8  15.5  16.1 Within the same core (sharing L1D/L2) , TR/Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Core (ns)  46.4 [-36%]  72.3  44.3  45 Within the same compute unit (sharing L3), the latency is ~45ns is much lower than SKL-X
CPU Multi-Core Benchmark Inter-Unit Latency – Different CCX (ns)  184.7 [+4x]  135 Going inter-CCX increases the latency by 4 times thus threads sharing data must be properly scheduled.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Node(ns)  274.4 [+6x] Going inter-node increases the latency yet again by 6 times, thus scheduling is everything.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled – as latencies are much larger than inter-core; going off node increases latencies yet again but not by a lot; if anything inter-node interconnect seems pretty low latency comparatively.
Aggregated L1D Bandwidth (GB/s)  1372 [-40%] 2252  739  878 SKL/X has 512-bit data ports (for AVX512) so TR/Ryzen cannot compete but they would do better against older designs.
Aggregated L2 Bandwidth (GB/s)  990 [-2%]  1010  565  402 The 16x L2 caches have similar bandwidth to the 10x much bigger caches on SKL-X.
Aggregated L3 Bandwidth (GB/s)  749 [+2.6x]
 289  300  247 The 4x L3 caches have much higher bandwidth than the single SKL-X cache.
Aggregated Memory (GB/s)  56 [-18%]  69  28  31 Running at lower memory speed TR cannot beat SKL-X but has comparatively higher memory efficiency
Even with 16x L1D and L2 caches, TR cannot match the much faster SKL-X 10x caches – that have been updated for 512-bit support but they are competitive; the 4x L3 caches do soundly beat the unified one on SKL-X but then again sharing data not within the same CCX is going to be very much slower.

At 2400Mt/s TR is running 33% slower than SKL-X at 3200Mt/s but its bandwidth is just 18% lower – thus its 4x DDR4 controllers are more efficient – not something we’re used to seeing.

Data In-Page Random Latency (ns)  72.8 [4-17-37] [+2.75x]  26.4 [4-13-33]  70.7 [4-17-37]  20 [4-12-21] What we saw previously with Ryzen was not accident; TR also suffers from surprisingly large in-page latency, almost 3x of Intel designs. Either the TLBs are very slow or not working.
Data Full Random Latency (ns)  111.5 [4-17-44] [+47%]  75.5 [4-13-70]  87.9 [4-17-37]  65 [4-12-34] Out-of-page latencies are ‘better’ with TR/Ryzen ‘only’ ~50% slower than SKL/X.
Data Sequential Latency (ns)  5.5 [4-7-8] [=]  5.4 [4-11-13]  3.8 [4-7-8]
 4.1 [4-12-13] TR’s prefetchers are working well with sequential access pattern latency at ~5ns matching SKL-X.
We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s. It is a mystery as to why, as both out-of-page random and sequential are competitive. It does point to something with the TLBs as to whether they do work or are just very much slower for some reason.
Code In-Page Random Latency (ns)  17.2 [4-10-26] [+43%] 12 [4-14-28]  16.1 [4-9-25]  10 [4-11-21] With code we don’t see the same problem – with in-page latency a bit higher than SKL-X (40%) but nowhere as high as what we saw before.
Code Full Random Latency (ns)  178 [4-15-60] [+2x]  86.1 [4-16-106]  95.4 [4-13-49]  70 [4-11-47] Out-of-page latency is a bit higher than SKL-X but not as bad as before.
Code Sequential Latency (ns)  8.7 [4-10-20] [+33%]  6.5 [4-7-12]  8.4 [4-9-18]  5.3 [4-9-20] Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns and thus 33% higher than SKL-X.
While code access latencies are higher than the new SKL-X – they are comparative with the older designs and not as bad as we’ve seen with data. Overall it seems TR (like Ryzen) will need some memory controller optimisations regarding latencies – though bandwidth seems just great.
Memory Update Transactional (MTPS)  1.9 52.2 [HLE]  4.18  32.4 [HLE] SKL/X is in a world of its own due to support for HLE/RTM and there is not much TR/Ryzen can do about it.
Memory Update Record Only (MTPS)  1.88  57.23 [HLE]  4.22  25.4 [HLE] We see a similar pattern here.
Without HLE/RTM TR (like Ryzen) don’t have much chance against SKL/X but considering support for it is disabled in most SKUs, there’s not much AMD has to be worried about – no to mention Intel disabling it in the older HSW and BRW designs. But should AMD enable it in future designs Intel will have a problem on its hands…

Threadripper’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (16 vs 10); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that TR’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

TR’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs,and especially against older designs. The bandwidths are all competitive and especially the memory controllers seem to be more efficient – but latencies are a bit of a problem which AMD may have to improve in future designs.

Overall we’d still recommend TR over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates.