AMD Ryzen2 2700X Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen2” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen2” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we are testing them in this article!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen2 2700X Pinnacle Ridge AMD Ryzen2 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen2 L2 cache is unchanged but we’re told latencies have been improved. And 4x bigger than Intel’s!
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen2 L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
TLB 4kB pages
64 full-way 1536 8-way 64 full-way 1536 8-way 64 full-way 1536 8-way 64 8-way 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way 1536 2-way 64 full-way 1536 2-way 64 full-way 1536 2-way 8 full-way 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600-1200 600-1200 600-1200 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
2400 / 2933 2400 / 2933 2400 / 2666 2533 / 2400 Ryzen2 how supports up to 2933MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Memory runs at the same timings on both Ryzen2 and Ryzen1 but we shall see if measured latencies are different.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen2 2700X 8C/16T Pinnacle Ridge
Ryzen2 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T SkyLake
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 54.9 [+15%] 46.5 47.8 39 Ryzen2 manages 15% higher bandwidth between its cores, slightly better than just 11% clock increase – signalling some improvements under the hood.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 5.89 [+2%] 5.53 5.8 16.3 In worst-case pairs on Ryzen must go across CCXes – and with this link running at the same clock (1200MHz) on Ryzen2 we can only manage a 2% increase in bandwidth. This is why faster memory is needed.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.5 [-13%] 15.4 15.6 16.2 Within the same core (sharing L1D/L2), Ryzen2 manages a 13% reduction in latency, again better than just clock speed increase.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 40.1 [-7%] 43.5 43.2 47.3 Within the same compute unit (sharing L3), the latency decreased by 7% on Ryzen2 thus L3 seems to have improved also.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 128 [-6%] 132 236 Going inter-CCX we still see a 6% reduction in latency on Ryzen2 – with the CCX link at the same speed – a welcome surprise.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see a decent 6-7% reduction in L3/CCX latencies on Ryzen2 even when running at the same clock as Ryzen1.
Aggregated L1D Bandwidth (GB/s) 862 [+18%] 615 730 837 Right off we see a 18% bandwidth increase – almost 2x higher (than the 11% clock increase) – thus some improvements have been made to the cache system. It allows Ryzen2 to finally beat the i7 with its wide L1 data paths (512-bit) though with 2x more caches (8 vs 4).
Aggregated L2 Bandwidth (GB/s) 736 [+32%] 542 556 329 We see a huge 32% increase in L2 cache bandwidth – almost 3x clock increase (the 11%) suggesting the L2 caches have been improved also. Ryzen2 has thus 2x the L2 bandwidth of i7 though with 2x more caches (8 vs 4).
Aggregated L3 Bandwidth (GB/s) 339 [+19%] 398 284 238 The bandwidth of the L3 caches has also increased by 19% (2x clock increase) though we see the 6-core 2600 doing better (398 vs 339) likely due to less threads competing for the same L3 caches (12 vs 16). Ryzen2 L3 caches are not just 2x bigger than Intel but also 2x more bandwidth.
Aggregated Memory (GB/s) 30.2 [+2%] 30.2 29.6 29.1 With the same memory clock, Ryzen2 does still manage a small 2% improvement – signalling memory controller improvements. We also see Ryzen’s memory at 2400Mt/s having better bandwidth than Intel at 2533.
We see big improvements on Ryzen2 for all caches L1D/L2/L3 of 20-30% – more than just raw clock increase (11%) – so AMD has indeed made improvements – which to be fair needed to be done. The memory controller is also a bit more efficient (2%) though it can run at higher clocks than tested (2400Mt/s) – hopefully fast DDR4 memory will become more affordable.
Data In-Page Random Latency (ns) 66.4 (4-12-31) [-6%] [0][-5][-4] 66.4 (4-12-31) 70.5 (4-17-35) 20.4 (4-12-21) In-page latency has decreased by a noticeable 6% on Ryzen2  (both 2700X and 2600) – we see 5 clocks reduction for L2 and 4 for L3 a welcome improvement. But still a way to go to catch Intel which has 1/3x (three times less) latency.
Data Full Random Latency (ns) 80.9 (4-12-32) [-8%] [0][-5][-4] 79.4 (4-12-32) 87.6 (4-17-36) 63.9 (4-12-34) Out-of-page latencies have also been reduced by 8% on Ryzen2 (same memory) and we see the same 5 and 4 clock reduction for L2 and L3 (on both 2700X and 2600 it’s no fluke). Again these are welcome but still have a way to go to catch Intel.
Data Sequential Latency (ns) 3.4 (4-6-7) [-8%] [0][-1][0] 3.5 (4-6-7) 3.7 (4-7-7) 4.1 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% latency drop for Ryzen2.
Ryzen1’s issue was high memory latencies (in-page/full random) and Ryzen2 has reduced them all by 6-8%. While it is a good improvement, they are still pretty high compared to Intel’s thus more work needs to be done here.
Code In-Page Random Latency (ns) 14.2 (4-9-24) [-9%] [0][0][0] 14.6 (4-9-24) 15.6 (4-9-24) 10.1 (2-10-21) Code latencies were not a problem on Ryzen1 but we still see a welcome reduction of 9% on Ryzen2. (no clocks delta)
Code Full Random Latency (ns) 88.6 (4-14-49) [-9%] [0][+1][+2] 89.3 (4-14-49) 97.4 (4-13-47) 70.7 (2-11-46) Out-of-page latency also sees a 9% decrease on Ryzen2 but somewhat surprisingly a 1-2 clock increase.
Code Sequential Latency (ns) 7.6 (4-12-20) [-8%] [0][+1][+1] 7.8 (4-12-20) 8.3 (4-11-19) 5.0 (2-4-9) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% reduction on Ryzen2.
While code access latencies were not a problem on Ryzen1 and they also see a 8% improvement on Ryzen2 which is welcome. Note code L1i cache is 2x Intel’s (64kB vs 32).
Memory Update Transactional (MTPS) 4.7 [+10%] 5 4.28 33.2 HLE Ryzen2 is 10% faster than Ryzen1 but naturally without HLE support it cannot match the i7. But with Intel disabling HLE on all but top-end CPUs AMD does not have much to worry.
Memory Update Record Only (MTPS) 4.6 [+11%] 4.75 4.16 23 HLE With only record updates we still see an 11% increase.

Ryzen2 brings nice updates – good bandwidth increases to all caches L1D/L2/L3 and also well-needed latency reduction for data (and code) accesses. Yes, there is still work to be done to bring the latencies down further – but it may be just enough to beat Intel to 2nd place for a good while.

At the high-end, ThreadRipper2 will likely benefit most as it’s going against many-core SKL-X AVX512-enabled competitor which is a lot “tougher” than the normal SKL/KBL/CFL consumer versions.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

As with original Ryzen, the cache and memory system performance is not the clean-sweep we’ve seen in CPU testing – but Ryzen2 does bring welcome improvements in bandwidth and latency – which hopefully will further improve with firmware/BIOS updates (AGESA firmware).

With the potential to use faster DDR4 memory – Ryzen2 can do far better than in this test (e.g. with 2933/3200MHz memory). Unfortunately at this time DDR4 – especially high-end fast versions – memory is hideously expensive which is a bit of a problem. You may be better off using less but fast(er) memory with Ryzen designs.

Ryzen2 is a great update that will not disappoint upgraders and is likely to increase AMD’s market share. AMD is here to stay!

AMD Ryzen2 2700X Review & Benchmarks – CPU 8-core Performance

What is “Ryzen2” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen2” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen2 2700X Pinnacle Ridge
AMD Ryzen2 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
Cores (CU) / Threads (SP) 8C / 16T 6C / 12T 8C / 16T 4C / 8T Ryzen2 like its predecessor has the most cores and threads; it thus be down to IPC and clock speeds for performance improvements.
Speed (Min / Max / Turbo) 2.2-3.7-4.2GHz (22x-37x-42x) [+9% rated, +11% turbo] 1.55-3.4-3.9GHz (15x-34x-39x) 2.2-3.3-3.8GHz (22x-34x-38x) 0.8-4.0-4.2GHz (8x-40x-42x) Ryzen2 base clock is 9% higher while Turbo/Boost/XFR is 11% higher; we thus expect at least about 10% improvement in CPU benchmarks.
Power (TDP) 105W 65W 95W 91W Ryzen2 also increases TDP by 11% (105W vs 95) which may require a bit more cooling especially when overclocking.
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen2 L2 cache is unchanged but we’re told latencies have been improved. 4x bigger than Intel’s.
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen2 L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen2 2700X 8C/16T Pinnacle Ridge
Ryzen2 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 323 [+8%] 236 298 194 Right off Ryzen2 is 8% faster than Ryzen1, let’s hope it does better. Even 2600 beats the i7 easily
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 337 [+12%] 238 301 194 With a 64-bit integer workload – we finally get into gear, Ryzen2 is 12% faster than its old brother.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 204 [+12%] 144 182 107 Even in this floating-point test, Ryzen2 is again 12% faster. All AMD CPUs beat the i7 into dust.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 172 [+11%] 123 155 89 With FP64 nothing much changes, Ryzen2 is still 11% faster.
From integer workloads in Dhyrstone to floating-point workloads in Whestone, Ryzen2 is about 10% faster than Ryzen1: this is exactly in line with the speed increase (9-11%) but if you were expecting more you may be a tiny bit disappointed.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 619 [+16%] 428 535 510 In this vectorised AVX2 integer test Ryzen2 starts to pull ahead and is 16% faster than Ryzen1; perhaps some of the arch improvements benefit SIMD vectorised workloads.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 187 [+10%] 132 170 197 With a 64-bit AVX2 integer vectorised workload, Ryzen2 drops to just 10% but still in line with speed increase.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 5.83 [+7%] 4.12 5.47 3 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 drops to just 7% faster than Ryzen1 but still a decent improvement.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 577 [+11%] 409 520 453 In this floating-point AVX/FMA vectorised test, Ryzen2 is the standard 11% faster than Ryzen1.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 332 [+11%] 236 299 267 Switching to FP64 SIMD code, again Ryzen2 is just the standard 11% faster than Ryzen1.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 15.6 [+15%] 11 13.7 11 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – Ryzen2 manages to pull ahead further and is 15% faster.
In vectorised AVX2/FMA code we see a similar story with 10% average improvement (7-15%). It seems the SIMD units are unchanged. In any case the i7 is left in the dust.
BenchCrypt Crypto AES-256 (GB/s) 14.1 [+1%] 14.1 13.9 14.7 With AES HWA support all CPUs are memory bandwidth bound; as we’re testing Ryzen2 running at the same memory speed/timings there is still a very small improvement of 1%. But its advantage is that the memory controller is rated for 2933Mt/s operation (vs. 2533) thus with faster memory it could run considerably faster.
BenchCrypt Crypto AES-128 (GB/s) 14.2 [+1%] 14.2 14 14.8 What we saw with AES-256 just repeats with AES-128; Ryzen2 is marginally faster but the improvement is there.
BenchCrypt Crypto SHA2-256 (GB/s) 18.4 [+12%] 13.2 16.5 5.9 With SHA HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust; SHA is still memory bound but with just one (1) buffer it has larger headroom. Thus Ryzen2 can use its speed advantage and be 12% faster – impressive.
BenchCrypt Crypto SHA1 (GB/s) 19.2 [+14%] 13.1 16.8 11.3 Ryzen also accelerates the soon-to-be-defunct SHA1 and here it is even faster – 14% faster than Ryzen1.
BenchCrypt Crypto SHA2-512 (GB/s) 3.75 [+12%] 2.66 3.34 4.4 SHA2-512 is not accelerated by SHA HWA (version 1) thus Ryzen has to use the same vectorised AVX2 code path – it still is 12% faster than Ryzen1 but still loses to the i7. Those SIMD units are tough to beat.
In memory bandwidth bound algorithms, Ryzen2 will have to be used with faster memory (up to 2933Mt/s officially) in order to significantly beat its older Ryzen1 brother. Otherwise there is only a tiny 1% improvement.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 260 [+11%] 184 235 126 In this non-vectorised test we see Ryzen2 is the standard 11% faster than Ryzen1.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 221 [+11%] 157 199 112 Switching to FP64 code, nothing changes, Ryzen2 is still 11% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 106 [+23%] 76 86 27 Binomial uses thread shared data thus stresses the cache & memory system; here the arch(itecture) improvements do show, Ryzen2 23% faster – 2x more than expected. Not to mention 3x (three times) faster than the i7.
BenchFinance Binomial double/FP64 (kOPT/s) 60.8 [+28%] 43.2 47.5 29.2 With FP64 code Ryzen2 is now even faster – 28% faster than Ryzen1 not to mention 2x faster than the i7. Indeed it seems there improvements to the cache and memory system.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.4 [+11%] 38.6 49.2 49.2 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen2 does not seem to be able to reproduce its previous gain and is just the standard 11% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 41.2 [+10%] 29.1 37.3 20.3 Switching to FP64 nothing much changes, Ryzen2 is 10% faster.
Ryzen1 dies very well in these algorithms, but Ryzen2 does even better – especially when thread-local data is involved managing 23-28% improvement. For financial workloads Intel does not seem to have a chance anymore – Ryzen is impossible to beat.
BenchScience SGEMM (GFLOPS) float/FP32 275 [+10%] 238 250 267 In this tough vectorised AVX2/FMA algorithm Ryzen2 is still “just” the 10% faster than older Ryzen1 – but it finally manages to beat the i7.
BenchScience DGEMM (GFLOPS) double/FP64 113 [+4%] 103 109 116 With FP64 vectorised code, Ryzen2 only manages to be 4% faster. It seems the memory is holding it back thus faster memory would allow it to do much better.
BenchScience SFFT (GFLOPS) float/FP32 8.56 [+4%] 7.36 8.2 19.4 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; Ryzen2 is just 4% faster again and is still 1/2x the speed of the i7. Again it seems faster memory would help.
BenchScience DFFT (GFLOPS) double/FP64 7.42 [+1%] 6.87 7.32 9.19 With FP64 code, Ryzen2’s improvement reduces to just 1% over Ryzen1 and again slower than the i7.
BenchScience SNBODY (GFLOPS) float/FP32 279 [+12%] 197 249 269 N-Body simulation is vectorised but many memory accesses to shared data and Ryzen2 gets back to 12% improvement over Ryzen1. This allows it to finally overtake the i7.
BenchScience DNBODY (GFLOPS) double/FP64 114 [+13%] 80 101 79 With FP64 code nothing much changes, Ryzen2 is still 13% faster.
With highly vectorised SIMD code Ryzen2 still improves by the standard 10-12% but in memory-heavy code it needs to run at higher memory speed to significantly overtake Ryzen1. But it allows it to beat the i7 in more algorithms.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1290 [+11%] 913 1160 1170 In this vectorised integer AVX2 workload Ryzen2 is 11% faster allowing it to soundly beat the i7.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 551 [+11%] 391 497 435 Same algorithm but more shared data does not change things for Ryzen2. Only the i7 falls behind.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 307 [+11%] 218 276 233 Again same algorithm but even more data shared does not change anything, but now the i7 is so far behind Ryzen2 is 50% faster. Incredible.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 461 [+11%] 326 415 384 Different algorithm but still AVX2 vectorised workload still changes nothing – Ryzen2 is 11% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 69.7 [+12%] 49.7 62 38 Still AVX2 vectorised code and still nothing changes; the i7 falls even further behind with Ryzen2 2x (two times) as fast.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 24.7 [+11%] 17.5 22.3 20 Again we see Ryzen2 11% faster than the older Ryzen1 and pulling away from the i7.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1460 [+8%] 1130 1350 1670 Here Ryzen2 is just 8% faster than Ryzen1 but strangely it’s not enough to beat the i7. Those SIMD units are way fast.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 243 [+11%] 172 219 268 In this final test, Ryzen2 returns to being 11% faster and again strangely not enough to beat the i7.

With all the modern instruction sets supported (AVX2, FMA, AES and SHA HWA) Ryzen2 does extremely well in all workloads – but it generally improves only by the 11% as per clock speed increase, except in some cases which seem to show improvements in the cache and memory system (which we have not tested yet).

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest drivers. .Net 4.7.x (RyuJit), Java 1.9.x. Turbo / Boost was enabled on all configurations.

VM Benchmarks Ryzen2 2700X 8C/16T Pinnacle Ridge
Ryzen2 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 63.2 [+8%] 30 58.6 26 .Net CLR integer performance starts off OK with Ryzen2 just 8% faster than Ryzen1 but now almost 3x (three times) faster than i7.
BenchDotNetAA .Net Dhrystone Long (GIPS) 49.6 [+20%] 33.6 41.2 27 Ryzen seems to favour 64-bit integer workloads, with Ryzen2 20% faster a lot higher than expected.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 104 [+15%] 71.2 90.5 54.3 Floating-Point CLR performance was pretty spectacular with Ryzen already, but Ryzen2 is 15% than Ryzen1 still.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 122 [+20%] 88.2 102 65.6 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen2 even faster by 20%.
Ryzen1’s performance in .Net was pretty incredible but Ryzen2 is even faster – even faster than expected by mere clock speed increase. There is only one game in town now for .Net applications.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 106 [+9%] 74 97 54 Just as we saw with Dhrystone, this integer workload sees a 9% improvement for Ryzen2 which makes it 2x faster than the i7.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 111 [+8%] 78 103 57 With 64-bit integer workload we see a similar story – Ryzen2 is 8% faster and again 2x faster than the i7.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 387 [+11%] 278 348 240 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Ryzen2 is 11% faster but still almost 2x faster than i7 despite its fast SIMD units
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 217 [+12%] 153 194 48.6 Switching to FP64 SIMD vector code – still running AVX/FMA – Ryzen2 is still 12% faster. i7 is truly left in the dust 1/4x the speed.
Ryzen2 is the usual 9-12% faster than Ryzen1 here but it means that even RyuJit’s SIMD support cannot save Intel’s i7 – it would take 2x as many cores (not 50%) to beat Ryzen2.
Java Arithmetic Java Dhrystone Integer (GIPS) 574 [+12%] 399 514 We start JVM integer performance with the usual 12% gain over Ryzen1.
Java Arithmetic Java Dhrystone Long (GIPS) 559 [+12%] 392 500 Nothing much changes with 64-bit integer workload, we have Ryzen2 12% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 138 [+13%] 99 122 With a floating-point workload Ryzen2 performance improvement is 13%.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 137 [+7%] 97 128 With FP64 workload Ryzen2 is just 7% faster but still welcome
Java performance improves by the expected amount 7-13% on Ryzen2 and allows it to completely dominate the i7.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 108 [+15%] 76 94 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but here Ryzen2 manages a 15% lead over Ryzen1.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 114 [+24%] 73 92 With 64-bit vectorised workload Ryzen2 (similar to .Net) increases its lead by 24%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 99 [+14%] 69 87 Switching to floating-point we return to the usual 14% speed improvement.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 93 [+1%] 64 92 With FP64 workload Ryzen2’s lead somewhat unexplicably drops to 1%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives Ryzen2 free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

Ryzen1 dominated the .Net and Java benchmarks – but now Ryzen2 extends that dominance out-of-reach. It would take a very much improved run-time or Intel CPU to get anywhere close. For .Net and Java code, Ryzen is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen2 is a worthy update but its speed increase is generally due to its faster clock speed – similar to Intel’s SkyLake > KabyLake (gen 6 to gen 7) transition. But coming at the same price, a “free” performance increase of 10% or so is obviously not to be ignored. Let’s not forget that Ryzen2 can still use all the existing series 300 mainboards – subject to BIOS update.

The process shrink and power optimisations does allow Ryzen2 to run at lower voltages and consume less power – even though TDP has increased at least “on paper”.

Some algorithms do seem to show that the cache and memory system has been improved – but Ryzen2’s advantage is that it can (much) faster memory. Unfortunately at this time DDR4 memory, especially fast versions, are very expensive. Here Intel does (still) have an advantage in that fast DDR4 memory is not required except for bandwidth bound algorithms.

One advantage is that by now operating systems (and applications) have been updated to deal with its dual-CCX design that used to be so much trouble when we benchmarked Ryzen1 initially. With AMD increasing its market share no high-performance application can afford to ignore AMD CPUs.

We (just) cannot wait to see the new improvements in future AMD designs and especially the ThreadRipper2 update!

AMD Ryzen 5 Series Launch & Reviews

AMD Logo

Today marks the day AMD’s latest Ryzen 5 series launches (6C/12T 1600X, 1600 and 4C/8T 1500X, 1400) and the reviews – including Sandra benchmarks have hit the web:

Congratulations to AMD a great product and look forward our review of Ryzen 5 here too!

AMD Ryzen 1700X Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen”?

“Ryzen” (code-name ZP aka “Zeppelin”) is the latest generation CPU from AMD (2017) replacing the previous “Vishera”/”Bulldozer” designs for desktop and server platforms. An APU version with an integrated (GP)GPU will be launched later (Ryzen2) and likely include a few improvements as well.

This is the “make or break” CPU for AMD and thus greatly improve performance, including much higher IPC (instructions per clock), higher sustained clocks, better Turbo performance and “proper” SMT (simultaneous multi-threading). Thus there are no longer “core modules” but proper “cores with 2 SMT threads” so an “eight-core CPU” really sports 8C/16T and not 4M/8T.

No new chipsets have been introduced – thus Ryzen should work with current 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update – making it a great upgrade.

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Ryzen 1700X
Intel 6700K (Skylake)
Intel 5820K (Haswell-E) Comments
TLB 4kB pages
64 full-way

1536 8-way

64 8-way

1536 6-way

64 4-way

1024 8-way

Ryzen has comparatively ‘better’ TLBs than even SKL while the 2-generation older HSW-E is showing its age.
TLB 2MB pages
64 full-way

1536 2-way

8 full-way

1536 6-way

8 full-way

1024 8-way

Nothing much changes for 2MB pages with Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-4000 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2133 / 2133 Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL supports only up to 2400 officially but happily runs at 2533MHz; old HSW-E can only do 2133MHz but with 4 memory channels.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 4 / 256-bit HSW-E leads with 4 memory channels of DDR4 providing massive bandwidth for its cores; however both Ryzen and Skylake can use faster DDR4 memory reducing this problem somewhat.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T Despite faster memory Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 47.7 [+3%] 39 46 With 8 cores (and thus 8 pairs) Ryzen’s bandwidth matches the 6-core HSW-E and is 22% higher than SKL thus decent.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 13 [-23%] 16 17 In worst-case pairs on Ryzen must go across CCXes while on Intel CPUs they can still use L3 cache to exchange data – thus it ends up about 23% slower than both SKL and HSW-E but it is not catastrophic.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.7 [-2%] 16 13.4 Within the same core (sharing L1D/L2) , Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 45 [-8%] 49 58 Within the same compute unit (sharing L3), the latency is ~45ns a bit lower than either SKL and much lower than HSW-E thus so far so good!
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 [+3x] Going inter-CCX increases the latency by 3 times to about 130ns thus if the threads are not properly scheduled you can end up with far higher latency and far less bandwidth.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled to avoid inter-CCX transfers where possible. As the CCX link runs at UMC speed using faster memory increases link bandwidth and decreases its latency which helps no end.
Aggregated L1D Bandwidth (GB/s) 727 [-17%] 878 1150 SKL has 512-bit data ports so Ryzen cannot compete with that but it does well against BRW-E.
Aggregated L2 Bandwidth (GB/s) 557 [+38%] 402 500 The 8 L2 caches have competitive bandwidth thus overall Ryzen has though BRW-E does well.
Aggregated L3 Bandwidth (GB/s) 392 [+58%] 247 205 Even spread over the 2 CCXes the L3 caches have huge aggregated bandwidth  – over over SKL.
Aggregated Memory (GB/s) 28.5 [-8%] 31 42.5 Running at lower memory speed Ryzen cannot beat SKL nor BRW-E with its 4 memory controllers but has higher comparative efficiency.
The 8x L2 caches and 2x L3 caches have much higher aggregated latency than either Intel CPU while the memory controller is also more efficient though it cannot compete with 4-channel BRW-E. But its 8x L1D caches are not “wide enough” to compete with SKL’s widened data ports (again widened in HSW). This may be one reason SIMD performance is not as high with Ryzen and AMD may have to widen them going forward especially when adding AVX512 support.
Data In-Page Random Latency (ns) 74 [+2.9x] (4-17-36) 20 (4-12-21) 25.3 (4-12-26) In-page latency is surprisingly large, almost 3x old SNB-E and ~4x SKL! Ryzen’s TLBs seem slow. L1 and L2 latencies are comparative (4 and 17 clk) but L3 latency is already ~50% higher than HSW and almost 2x SKL.
Data Full Random Latency (ns) 95 [+31%] (4-17-37) 65 (4-12-34) 72 (4-13-52) Out-of-page latencies are ‘better’ with Ryzen ‘only’ ~30% slower than HSW-E and about 50% slower than SKL. Again L1 is and L2 are fine (4 and 17 clk) and L3 is comparative to SKL (37 vs 34 clk) while old HSW-E trails (52 clk)!
Data Sequential Latency (ns) 4.2 [+1%] (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency at ~4ns matching SKL and much better than old HSW-E.
We finally discover an issue – Ryzen’s memory latencies (in-page) are very high compared to the competition – TLB issue? Fortunately sequential and out-of-page performance is fine so perhaps its memory prefetchers can alleviate the problem somewhat but it is something that will need to be addressed
Code In-Page Random Latency (ns) 16.6 [+5%] (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code we don’t see the same problem – with in-page latency matching HSW-E, though still about 50% higher than SKL. The twice as large (64kB) L1I cache seems to have the same (4ckl) latency as SKL’s. No issues with L2 nor L3 latencies either.
Code Full Random Latency (ns) 102 [+20%] (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page latency is a bit higher than both Intel CPUs, ~50% higher than SKL but nothing as bad as we’ve seen with data.
Code Sequential Latency (ns) 8.9 [-12%] (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns comparative to HSW-E but again about ~67% higher than SKL.
While code access latencies are higher than the new SKL – they are comparative with the older HSW-E and nowhere near as bad as that we’ve seen with data. Even the twice as large L1I (L1 instruction cache) behaves itself with 4clk latency similar to Intel’s L1I smaller versions. It is thus a mystery with data is affected but not code.
Memory Update Transactional (MTPS) 4.23 [-39%] 32.4 HLE 7 SKL is in a World of its own due to support for HLE/RTM but Ryzen is still about 40% slower than HSW-E with just 6 cores.
Memory Update Record Only (MTPS) 4.19 [-23%] 25.4 HLE 5.47 With only record updates the difference drops to about 20% but again HLE shows its power for transaction processing.
Without HLE/RTM Ryzen is not going to win against Intel’s latest – but then again HLE/RTM are disabled in all but the very top-end CPUs – not to mention killed in previous HSW and BRW architectures so it is not a big problem. But if future models were to enable it, Intel will have a big problem on its hands…

Ryzen’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (8 vs 6 or 4); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that Ryzen’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs, especially the older HSW (and thus BRW) cores while the newer SKL (and thus KBL) cores sporting improved caches and TLBs which are hard for Ryzen to beat. Still it’s nothing to be worried about and perhaps AMD will be able improve things further with microcode/firmware updates if not new steppings and models (e.g. the APU model 10).

Overall we’d still recommend Ryzen over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates. The platform has a bright future with more CPUs destined to use the AM4 socket while both 1551 (SKL/KBL) and 2011 (HSW-E/BRW-E) platforms due to be replaced again with no future upgrades.

AMD Ryzen 1700X Review & Benchmarks – CPU 8-core Performance

What is “Ryzen”?

“Ryzen” (code-name ZP aka “Zeppelin”) is the latest generation CPU from AMD (2017) replacing the previous “Vishera”/”Bulldozer” designs for desktop and server platforms. An APU version with an integrated (GP)GPU will be launched later (Ryzen2) and likely include a few improvements as well.

This is the “make or break” CPU for AMD and thus greatly improve performance, including much higher IPC (instructions per clock), higher sustained clocks, better Turbo performance and “proper” SMT (simultaneous multi-threading). Thus there are no longer “core modules” but proper “cores with 2 SMT threads” so an “eight-core CPU” really sports 8C/16T and not 4M/8T.

No new chipsets have been introduced – thus Ryzen should work with current 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update – making it a great upgrade.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design.

Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Ryzen 1700X
Intel 6700K (Skylake)
Intel 5820K (Haswell-E) Comments
Cores (CU) / Threads (SP) 8C / 16T 4C / 8T 6C / 12T Ryzen has the most cores and threads – so it will be down to IPC and clock speeds. But if it’s threads you want Ryzen delivers.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x)  0.8-4.0-4.2GHz (8x-40x-42x)  1.2-3.3-4.0GHz (12x-33x-40x) SKL has the highest rated speed @4GHz but all three have comparative Turbo clocks thus depends on how long they can sustain it.
Power (TDP) 95W 91W 140W Ryzen has comparative TDP to SKL while HSW-E almost 50% higher.
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 6x 32kB 8-way / 6x 32kB 2-way Ryzen instruction cache is 2x the data cache a somewhat strange decision; all caches are 8-way except the HSW-E’s L1I.
L2 Caches 8x 512kB 8-way 4x 256kB 8-way 6x 256kB 8-way Ryzen L2 is 2x as big as either Intel CPU which should help quite a bit though still 8-way
L3 Caches 2x 8MB 16-way 8MB 16-way 15MB 20-way With 2x as many cores/threads, Ryzen has 2 8MB caches one for each CCX.

Thread Scheduling and Windows

Ryzen’s topology (4 cores in 2 CCXes (compute clusters)) makes it akin to the old Core 2 Quad or Pentium D (2 dies onto 1 socket) effectively a SMP (dual CPU) system on a single socket. Windows has always tended to migrate running threads from unit to unit in order to equalise thermal dissipation though Windows 10/Server 2016 have increased the ‘stickiness’ of threads to units.

As the Windows’ scheduler is inter-twined with the power management system, under ‘Balanced‘ and other power saving profiles – unused cores are ‘parked’ (aka powered down) which affects which cores are available for scheduling. AMD has recommended ‘High Performance‘ profile as well as initially claiming the Windows’ scheduler is not ‘Ryzen-aware’ before retracting the statement.

However, there does seem to be a problem as in Sandra tests when using less than the total 16 threads (e.g. MC test with 8 threads) in tests where Sandra does not hard schedule threads based on its own scheduler (e.g. .Net, Java benchmarks) the scheduling does not appear optimal:

 Ryzen Hard Affinity  Ryzen no Affinity
Ryzen Hard Affinity (e.g. Native) Ryzen No Affinity (e.g. Java/.Net)

While in the left image we see Sandra at work assigning the 8 threads on the 8 different cores – with 100% utilisation on those units and almost nothing on the other 8 – on the right image we see 10 units (!) used, 4 not used at all but still 50% utilisation.

This does not seem to happen on Intel hardware – even SMP systems – thus it may be something to be adjusted in future Windows versions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 290 [+24%] | 242 [+13%] AVX2 185 | 146 233 | 213 Right off the bat Ryzen beats both Intel CPUs in both MT and MC tests with SMT providing a good gain (hard scheduled of course).
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 292 [+27%] | 260 [+22%] AVX2 185 | 146 230 | 213 With a 64-bit integet workload nothing much changes, Ryzen still beats both in both tests, 27% faster than HSW-E! AMD has ri-sen from the ashes like the Phoenix!
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 185 [+23%] | 123 [+23%] AVX/FMA 109 | 74 150 | 100 Even in this floating-point test, Ryzen beats both again by a similar margin, 23% better than HSW-E. What performance for the money!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 155 [+33%] | 102 [+32%] AVX/FMA 89 | 60 116 | 77 With FP64 the winning streak continues, with the difference increasing to 33% over HSW-E a huge gain.
From integer workloads in Dhyrstone to floating-point workloads in Whestone Ryzen rules the roost blowing both SKL and HSW-E away being between 23-33% faster, with or without SMT. SMT does yield bigger gain than on Intel’s designs also.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 535 [-16%] | 421 [-13%] AVX2 513 | 389 639 | 485 In this vectorised AVX2 integer test Ryzen just overtakes SKL but cannot beat HSW-E and is just 16% slower; still it is a good result but it shows Intel’s SIMD units are really strong with AMD’s 8 cores matching Intel’s 4 cores.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 159 [-16%] | 137 [-18%] AVX2 191 | 158 191 | 168 With a 64-bit AVX2 integer vectorised workload again Ryzen is unable to beat either Intel CPU being slower by a similar margin -16%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 3.61 [+30%] | 2.1 [+11%] 2.15 | 1.36 2.74 | 1.88 This is a tough test using Long integers to emulate Int128 without SIMD and here Ryzen comes back on top being 30% faster similar to what we saw in Dhrystone.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 530 [-11%] | 424 [-4%] FMA 479 | 332 601 | 440 In this floating-point AVX/FMA vectorised test we see again the power of Intel’s SIMD units, with Ryzen being only 11% slower than HSW-E but beating SKL.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 300 [-13%] | 247 [=] FMA 271 | 189 345 | 248 Switching to FP64 SIMD code, again Ryzen cannot beat HSW-E but does beat SKL which should be sufficient.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 13.7 [+14%] | 9.7 [+2%] FMA 10.7 | 7.5 12 | 9.5 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – Ryzen manages to beat both CPUs being 14% faster. So AVX2 or FMA code is not a problem.
In vectorised AVX2/FMA code we see Ryzen lose for the first time to Intel’s SIMD units but not by a large margin; in non-vectorised code as with Dhrystone and Whetstone Ryzen is again quite a bit faster than either Intel CPUs. Overall Ryzen would be the preferred choice unless number-crunching vectorised code.
BenchCrypt Crypto AES-256 (GB/s) 13.8 [-31%] | 14 [-32%] AES 15 | 15.4 20 | 20.7 All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – and 2 memory channels is just not enough; with its 4 channels HSW-E is unbeatable for streaming tests. But Ryzen is only marginally slower than its counterpart SKL.
BenchCrypt Crypto AES-128 (GB/s) 13.9 [-31%] | 14 [-33%] AES 15 | 15.4 20.1 | 21.2 What we saw with AES-256 just repeats with AES-128; Ryzen would need more memory channels to beat HSW-E but at least is marginally slower than SKL.
BenchCrypt Crypto SHA2-256 (GB/s) 17.1 [+2.25x] | 10.6 [+49%] SHA 5.9 | 5.5 AVX2 7.6 | 7.1 AVX2 Ryzen’s secret weapon is revealed: by supporting SHA HWA it soundly beats both Intel CPUs even running multi-buffer vectorised AVX2 code – it’s 2.2x faster! Surprisingly disabling SMT (MC mode) reduces performance appreciably, not what would be expected.
BenchCrypt Crypto SHA1 (GB/s) 17.3 [+14%] | 11.4 [-14%] SHA 11.3 | 10.6 AVX2 15.1 | 13.3 AVX2 Ryzen also accelerates the soon-to-be-defunct SHA1 but the AVX2 implementation is much less complex allowing SNB-E to come within a whisker of Ryzen and beat it in MC mode by a similar amount 14%. Still, much better to have SHA HWA than finding multiple buffers to process with AVX2.
BenchCrypt Crypto SHA2-512 (GB/s) 3.34 [-37%] | 3.32 [-36%] AVX2 4.4 | 4.2 5.34 | 5.2 SHA2-512 is not accelerated by SHA HWA (version 1) thus Ryzen has to use the same vectorised AVX2 code path where Intel’s SIMD units show their power again.
Ryzen’s secret crypto weapon is support for SHA HWA (which Intel only supports on Atom currently) which allows it to beat both Intel’s CPUs. For streaming algorithms like encrypt/decrypt it would probably benefit from more memory channels to feed all those cores. But overall it would still be the overall choice.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 234 [+49] | 166 [+36%] 129 | 97 157 | 122 In this non-vectorised test we see Ryzen shine brightly again beating even SNB-E by 50% an incredible result. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s) 198 [+51%] | 139 [+39%] 108 | 83 131 | 100 Switching to FP64 code, Ryzen still shines beating SNB-E by 50% again and totally demolishing SKL. So far so great!
BenchFinance Binomial float/FP32 (kOPT/s) 85.1 [+2.25x] | 83.2 [+3.23x] 27.2 | 18.1 37.8 | 25.7 Binomial uses thread shared data thus stresses the cache & memory system; we would expect Ryzen to falter here but nothing of the sort – it actually totally beats both Intel CPUs to dust – it’s 2.25 times faster than SNB-E! Even a 12 core SNB-E would not be sufficient.
BenchFinance Binomial double/FP64 (kOPT/s) 45.8 [+37%] | 46.3 [+38%] 25.5 | 24.6 33.3 | 33.5 With FP64 code the situation changes somewhat – with Ryzen only 37% faster than SNB-E; but it’s still an appreciable win. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 49.2 [+55%] | 41.2 [+52%] 25.9 | 21.9 31.6 | 27 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen reigns supreme here also being 50% faster than even HSW-E. SKL is left in the dust.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 37.3 [+75%] | 31.8 [+41%] 19.1 | 17.2 21.2 | 22.5 Switching to FP64 Ryzen increases its dominance to 75% over SNB-E and destroying SKL completely.
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads Ryzen reigns supreme beating even 6-core Haswell-E into dust by such a margin that even a 12-core HSW-E may not beat it. For financial workloads there is only one choice: Ryzen!!! Long live the new king!
BenchScience SGEMM (GFLOPS) float/FP32 68.3 [-63%] | 155 [-27%] FMA 109 | 162 185 | 213 In this tough vectorised AVX2/FMA algorithm Ryzen falters and gets soundly beaten by both SKL and HSW-E. Again the powerful SIMD units of Intel’s CPUs allow them to finally beat it as we’ve seen in previous tests. It’s its Achille’s heel.
BenchScience DGEMM (GFLOPS) double/FP64 62.7 [-28%] | 78.4 [-23%] FMA 72 | 67.8 87.7 | 103 With FP64 vectorised code, the gap reduces with Ryzen just 28% slower than HSW-E and just a bit slower than SKL. Again vectorised SIMD code is problematic.
BenchScience SFFT (GFLOPS) float/FP32 8.9 [-50%] | 9.85 [-39%] FMA 18.9 | 15 18 | 16.4 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; here Ryzen is again much slower than both SKL and HSW-E; for vectorised code it seems it needs 2x more SIMD units to match Intel.
BenchScience DFFT (GFLOPS) double/FP64 7.5 [-31%] | 7.3 [-30%] FMA 9.3 | 9 10.9 | 10.5 With FP64 code, Ryzen does improve (or Intel gets slower) only 30% slower than HSW-E and 15% slower than SKL.
BenchScience SNBODY (GFLOPS) float/FP32 234 [-15%] | 225 [-16%] FMA 273 | 271 158 | 158 N-Body simulation is vectorised but many memory accesses to shared data and here SKL seems to do unusually well beating Ryzen in 2nd place but only by 15%. Strangely HSW-E does badly in this test even with 6-cores.
BenchScience DNBODY (GFLOPS) double/FP64 87 [+10%] | 87 FMA 79 | 79 40 | 40 With FP64 code Ryzen improves beating its SKL rival by 10%; again SNB-E does pretty badly in this test.
With highly vectorised SIMD code Ryzen is again the loser but not by a lot; Intel has just one chance – highly vectorised SIMD algorithms that allow the powerful SIMD units to shine. Everything else is dominated by Ryzen.
CPU Image Processing Blur (3×3) Filter (MPix/s) 750 [-1%] | 699 [+4%] AVX2 655 | 563 760 | 668 In this vectorised integer AVX2 workload Ryzen ties with HSW-E, a good result considering we saw it lose in similar algorithms.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 316 [-8%] | 301 AVX2 285 | 258 345 | 327 Same algorithm but more shared data used sees Ryzen now 8% slower than SNB-E but still beating SKL.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 172 [-8%] | 166 AVX2 151 | 141 188 | 182 Again same algorithm but even more data shared does not change anything, Ryzen is again 8% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 292 [-7%] | 279 AVX2 271 | 242 316 | 276 Different algorithm but still AVX2 vectorised workload sees Ryzen still about 7% slower than HSW-E but again still faster than SKL.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 58.5 [+16%] | 37.4 AVX2 35.4 | 26.4 50.3 | 37 Still AVX2 vectorised code but here Ryzen manages to beat even SNB-E by 16%. Thus it is not a given it will lose in all such tests, it just depends.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.6 [+26%] | 5.2 6.3 | 4.2 7.6 | 5.5 This test is not vectorised though it uses SIMD instructions and here Ryzen manages a 26% win even over SNB-E while leaving SKL in the dust.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 852 [+50%] | 525 422 | 297 571 | 420 Again in a non-vectorised test Ryzen just flies: it’s 2x faster than SKL and no less than 50% faster than SNB-E! Intel does not have its way all the time – unless the code is highly vectorised!
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 147 [+47%] | 101 75 | 55 101 | 77 In this final non-vectorised test Ryzen really flies, it’s again 2x faster than SKL and almost 50% faster than SNB-E! Intel must be getting desperate for SIMD cectorised versions of algorithms by now…

With all the modern instruction sets supported (AVX2, FMA, AES and SHA HWA) Ryzen does extremely well beating both Skylake 4C and even Haswell-E 6C in all workloads except highly vectorised SIMD code where the powerful Intel SIMD units can shine. Overall it would still be the choice for most workloads but SIMD number-crunching tasks which are somewhat specialised.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) more memory bandwidth to feed all the 16-threads would not go amiss but the difference may not justify the increased cost as we see with Intel 2011 platform and HSW-E.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.6.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 36.5 [+18%] | 25 23.3 | 17.2 30.7 | 26.8 .Net CLR integer performance starts off very well with a 36% better performance even over HSW-E which admittedly does not do much better over SKL.
BenchDotNetAA .Net Dhrystone Long (GIPS) 45.1 [+60%] | 26 23.6 | 21.6 28.2 | 25 Ryzen seems to greatly favour 64-bit integer workloads, here it is 60% faster than even HSW-E and over 2x faster than SKL. All CPUs perform better with 64-bit workloads.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 100.6 [+53%] | 53 47.4 | 21.4 65.4 | 39.4 Floating-Point CLR performance is pretty spectacular with Ryzen beating HSW-E by over 50% a pretty incredible result. Native or CLR code works just great on Ryzen.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 121.3 [+41%] | 62 63.6 | 37.5 85.7 | 53.4 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen just over 40% faster than HSW-E.
It’s pretty incredible, for .Net applications Ryzen is king – no point buying Intel’s 2011 platform – buy Ryzen! With more and more applications (apps?) running under the CLR, Ryzen has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 92.6 [+22%] | 49 55.7 | 37.5 75.4 | 49.9 Just as we saw with Dhrystone, this integer workload sees a 22% improvement for Ryzen. While RiuJit supports SIMD integer vectors the lack of bitfield instructions make it slower for our code; shame.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 97.8 [+23%] | 51 60.3 | 39.5 79.2 | 53.1 With 64-bit integer workload we see a similar story – Ryzen is 23% faster than even HSW-E. If only RyuJit SIMD would fix integer workloads too.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 272.7 [-4%] | 156 AVX 12.9 | 6.74 284.2 | 187.1 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units with Ryzen 4% slower than SNB-E. Still Intel usually wins these kinds of tests.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 149 [-15%] | 85 AVX 38.7 | 21.38 176.1 | 103.3 Switching to FP64 SIMD vector code – still running AVX/FMA – Ryzen loses again, this time by 15% against SNB-E.
The only tests Intel’s CPUs can win are vectorised ones using RyuJit’s support for SIMD (aka SSE2, AVX/FMA) and thus allowing Intel’s SIMD units to shine; otherwise Ryzen dominates absolutely everything without fail.
Java Arithmetic Java Dhrystone Integer (GIPS)  513 [+29%] | 311  313 | 289  395 | 321 We start JVM integer performance with an even bigger gain, Ryzen is ~30% faster than HSW-E and 60% faster than SKL.
Java Arithmetic Java Dhrystone Long (GIPS) 514 [+28%] | 311 332 | 299 399 | 367 Nothing much changes with 64-bit integer workload, we have Ryzen 28% faster than HSW-E.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 117 [+31%] | 66 62.8 | 34.6 89 | 49 With a floating-point workload Ryzen continues its lead over both Intel’s CPUs. Native or CLR or JVM code works just great on Ryzen.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 128 [+40%] | 63 64.6 | 36 91 | 53 With FP64 workload the gap increases even further to 40% over HSW-E and an incredible 2x over SKL! Ryzen is the JVM king.
Java performance is even more incredible than what we’ve seen in .Net; server people rejoice, if you have Java workloads Ryzen is the CPU for you! 40% better performance than Intel’s 2011 platform for much lower cost? Yes please!
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 99 [+20%] | 52.6 59.5 | 36.5 82 | 49 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but here Ryzen manages a 20% lead over HSW-E but is almost 2x faster than SKL.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 93 [+17%] | 51 60.6 | 37.7 79 | 53 With 64-bit vectorised workload Ryzen maintains its lead of about 20%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 86 [+40%] | 42.3 40.6 | 22.1 61 | 32 Just as we’ve seen with Whetstone, Ryzen is about 40% faster than HSW-E and over 2x faster than SKL! It does not get a lot better than this.

Intel better hope Oracle will add vector primitives allowing SIMD code to use the power of its CPU’s SIMD units.

Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 82 [+30%] | 42 40.9 | 22.1 63 | 32 With FP64 workload Ryzen’s lead somewhat unexplicably drops to ‘just’ 30% but remains over 2x faster than SKL. Nothing to grumble about really.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives Ryzen free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

Ryzen absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than on Intel’s (ex)-top-of-the-range Haswell-E – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on Ryzen. For .Net and Java code, Ryzen is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

What a return of fortune from AMD! Despite a hurried launch and inevitable issues which will be fixed in time (e.g. Windows scheduler), Ryzen puts a strong performance beating Intel’s previous top-of-the-range Skylake 6700K and Haswell-E 6820K into dust in most tests at a much cheaper price.

Of course there are setbacks, highly vectorised AVX2/FMA code greatly favour Intel’s SIMD units and here Ryzen falls behind a bit; streaming algorithms can overload the 2 memory channels but then again Intel’s mainstream platform has only 2 also. Still if you were replacing a 2011 4-channel platform with Ryzen then very high-speed memory may be required to sustain performance.

It’s dual-CCX design may also affect non-symmetrical workloads where different threads execute different code with thread data-sharing across CCX naturally slower. Clever thread assignment to the ‘right’ CCX should fix those issues but that is down to each application with Windows (or other OSes) may not be able to fix. Considering we have SMP and NUMA systems out there – it is not a new problem but perhaps one not usually seen on normal desktop systems due to the high-cost of SMP/NUMA systems.

All in all Ryzen is a solid CPU which should worry Intel at the high-end, we shall have to see how the lower-end 4-core and even 2-core versions perform.