Sandra Platinum (2017) SP3 – Updates and Fixes

Update Wizard

We are pleased to release SP2 (Service Pack 3 – version 24.50) update for Sandra Platinum (2017) with the following updates:

Sandra Platinum (2017) Press Release

  • Fix start-up crash with non-AVX older systems (both x86 and x64)
  • Tools update allowing further optimisations to AVX512 benchmarks
  • Other hardware information fixes depending on APIC ID configuration
    • AMD Ryzen 5 – L3 cache detection 4x not 2x
    • AMD Ryzen 3 – detected as SMP

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Sandra Platinum (2017) SP2 – NUMA for ThreadRipper, AVX512 for SKL-X

Update Wizard

We are pleased to release SP2 (Service Pack 2 – version 24.41) update for Sandra Platinum (2017) with the following updates:

Sandra Platinum (2017) Press Release

  • Tools update allowing further ports of benchmarks to AVX512, e.g.:
    • CPU Multi-Media: 128-bit (octa) floating-point benchmark
    • CPU Scientific: GEMM, FFT N-Body (single and double floating-point)
    • CPU Image Processing: All filters vectorised and ported to AVX512 (Blur/Sharpen/Motion-Blur, Edge/Noise/Oil, Diffusion/Marble)
  • Algorithm harness update allowing NUMA multi-block performance improvement, e.g.:
    • CPU Multi-Media: all algorithms. both integer and floating-point.
    • CPU Cryptography: all algorithms, both crypto and hashing.
    • CPU Scientific: all algorithms (but especially (F/D)GEMM)
    • CPU Financial: Monte-Carlo (N/A others).
    • CPU Image Processing: All filters (Blur/Sharpen/Motion-Blur, Edge/Noise/Oil, Diffusion/Marble).

New articles showing the improvement that AVX512 and NUMA bring:

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

AVX512 performance improvement for SKL-X in Sandra SP2

Intel Skylake-X Core i9

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA3/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators – albeit in a somewhat different form – it has finally made it to its CPU lines with Skylake-X (SKL-X/EX/EP) – for now HEDT (i9) and Server (Xeon) – and hopefully to mainstream at some point.

Note it is rumoured the current Skylake (SKL)/Kabylake (KBL) are also supposed to support it based on core changes (widening of ports to 512-bit, unit changes, etc.) – nevertheless no public way of engaging them has been found.

AVX512 consists of multiple extensions and not all CPUs (or GPGPUs) may implement them all:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit. [supported by SKL-X, Phi]
  • AVX512DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers [supported by SKL-X]
  • AVX512CD – Conflict Detection – loop vectorisation through predication [not supported by SKL-X but Phi]
  • AVX512ER – Exponential & Reciprocal – transcedental operations [not supported by SKL-X but Phi]
  • more sets will be introduced in future versions

As with anything, simply doubling register width does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains – some of which may require future arch or even tools to realise their true potential.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks SKL-X AVX512 SKL-X AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  1460 [+23%]  1180 For integer workloads we manage only 23% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  519 [+19%]  435 With a 64-bit integer workload the improvement reduces to 19%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  7.72 [=]  7.62 No SIMD here
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+80%]  1000 In this floating-point test we finally see the power of AVX512 – it is 80% faster than AVX2/FMA3 – a huge improvement.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  1150 [+85%]  622 Switching to FP64 increases the improvement to 85% a huge gain.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  36 [+50%]  24 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 50% improvement still nothing to ignore.
AVX512 cannot bring 100% improvement but does manage up to 85% improvement – a no mean feat! While integer workload is only 20-25% it is still decent. Heavy compute algorithms will greatly benefit from AVX512.
BenchCrypt Crypto SHA2-256 (GB/s) 26 [+78%]  14.6 With no data dependency – we get good scaling of almost 80% even with this integer workload.
BenchCrypt Crypto SHA1 (GB/s)  39.8 [+51%]  26.4 Here we see only 50% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s)  21.2 [+94%]  10.9 With 64-bit integer workload we see almost perfect scaling of 94%.
As we work on different buffers and have no dependencies, AVX512 brings up to 94% performance improvement – only limited by memory bandwidth with even 4 channel DDR4 @ 3200Mt/s not enough for 10C/20T CPU. AVX512 is absolutely worth it to drive the system to the limit.
BenchScience SGEMM (GFLOPS) float/FP32  558 [-7%]  605 Unfortunately the current compiler does not seem to help.
BenchScience DGEMM (GFLOPS) double/FP64  235 [+2%]  229 Changing to FP64 at least allows AVX512 to with by a meagre 2%.
BenchScience SFFT (GFLOPS) float/FP32  35.3 [=]  35.3 Again the compiler does not seem to help here.
BenchScience DFFT (GFLOPS) double/FP64  19.9 [-2%]  20.2 With FP64 nothing much happens.
BenchScience SNBODY (GFLOPS) float/FP32  585 [-1%]  591 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64  175 [-1%]  178 With FP64 workload nothing much changes.
With complex SIMD code – not written in assembler the compiler has some ways to go and performance is not great. But at least the performance is not worse.
CPU Image Processing Blur (3×3) Filter (MPix/s)  3830 [+60%]  2390 We start well here with AVX512 60% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  1700 [+70%]  1000 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  885 [+56%]  566 Again same algorithm but even more data shared now brings the improvement down to 56%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  1290 [+56%]  826 Using two buffers does not change much still 56% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  136 [+59%]  85 Different algorithm keeps the AVX512 advantage the same at about 60%.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  65.6 [+31.7%]  49.8 Using the new scatter/gather in AVX512 still brings 30% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  3920 [+3%]  3800 Here we have a 64-bit integer workload algorithm with many gathers with AVX512 likely memory latency bound thus almost no improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  770 [+2%]  755 Again loads of gathers does not allow AVX512 to shine but still decent performance
As with other SIMD tests, AVX512 brings between 60-70% performance increase, very impressive. However in algorithms that involve heavy memory access (scatter/gather) we are limited by memory latency and thus we see almost no delta but at least it is not slower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that even for a 1st-generation CPU with AVX512 support, SKL-X greatly benefits from the new instruction set – with anything between 50-95% performance improvement. However compiler/tools are raw (VC++ 2017 only added support in the recent 15.3 version) and performance sketchy where hand-crafted assembler is not used. But these will get better and future CPU generations (CFL-X, etc.) will likely improve performance.

Also let’s remember that some SKUs have 2x FMA (aka 512-bit) (and other instructions) licence – while most SKUs have only 1x FMA (aka 256-bit); the former SKUs likely benefit even more from AVX512 and it is something Intel may be more generous in enabling in future generations.

In algorithms heavily dependent on memory bandwidth or latency AVX512 cannot work miracles, but at least will extract the maximum possible compute performance from the CPU. SKUs with lower number of cores (8, 6, 4, etc.) likely to gain even more from AVX512.

In addition let’s not forget the “Phi” accelerators – that also support AVX512 – thus porting code will allow great performance on many-core (MIC) architecture too.

NUMA performance improvement for ThreadRipper in Sandra SP2

What is NUMA?

Modern CPUs have had a built-in memory controller for many years now, starting with the K8/Opteron, in order to higher better bandwidth and lower latency. As a result in SMP systems each CPU has their own memory controller and its own system memory that it can access at high speed – while to access other memory it must send requests to the other CPUs. NUMA is a way of describing such systems and allow the operating system and applications to allocate memory on the node they are running on for best performance.

As ThreadRipper is really two (2) Ryzen dies connected internally through InfinityFabric – it is basically a 2-CPU SMP system and thus a 2-node NUMA system.

While it is possible to configure it in UMA (Uniform Memory Access mode) where all memory appears to be unified and interleaved between nodes, for best performance the NUMA mode is recommended when the operating system and applications support it.

While Sandra has always supported NUMA in the standard benchmarks – some of the new benchmarks have not been updated with NUMA support especially since multi-core systems have pretty much killed SMP systems on the desktop – with only expensive severs left to bring SMP / NUMA support.

Note that all the NUMA improvements here would apply to competitor NUMA (e.g. Intel) systems, thus it is not just for ThreadRipper – with EPYC systems likely showing a far higher improvement too.

In this article we test NUMA performance; please see our other articles on:

Native Performance

We are testing native performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks NUMA 2-nodes
UMA single-node
Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  965 [+2.8%]  938 The ‘lightest’ workload should show some NUMA overhead but we can only manage 3% here.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  312 [+2.3%] 305 With a 64-bit integer workload the improvement drops to 2%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  10.9 [=]  10.9 Emulating int128 means far increased compute workload with NUMA overhead insignificant.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s)  997 [+1.2%]  985 Again no measured improvement here.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  562 [+1%]  556 Again no measured improvement here.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  27 [=]  26.85 In this heavy algorithm using FP64 to mantissa extend FP128 we see no improvement.
Fractals are compute intensive with few memory accesses – mainly to store results – thus we see a maximum of 3% improvement with NUMA support with the rest insignificant. However, this is a simple 2-node system – bigger 4/8-node systems would likely show bigger gains.
BenchCrypt Crypto AES256 (GB/s) 27.1 [+139%] 11.3 AES hardware accelerated is memory bandwidth bound thus NUMA support matters; even in this 2-node system we see a over 2x improvement of 139%!
BenchCrypt Crypto AES128 (GB/s) 27.4 [+142%] 11.3 Similar to above we see a massive 142% improvement by allocating memory on the right NUMA node.
BenchCrypt Crypto SHA2-256 (GB/s)  32.3 [+50%] 21.4 SHA is also hardware accelerated but operates on a single input buffer (with a small output hash value buffer) and here out improvement drops to 50%, still very much significant.
BenchCrypt Crypto SHA1 (GB/s) 34.2  [+56%]  21.8 Similar to above we see an even larger 56% improvement for supporting NUMA.
BenchCrypt Crypto SHA2-512 (GB/s)  6.36 [=]  6.35 SHA2-256 is not hardware accelerated (AVX2 used) but heavy compute bound thus our improvement drops to nothing.
Finally in streaming algorithms we see just how much NUMA support matters: even on this 2-note system we see over 2x improvement of 140% when working with 2 buffers (in/out). When using a single buffer our improvement drops to 50% but still very much significant. TR needs NUMA suppport to shine.
BenchScience SGEMM (GFLOPS) float/FP32  395 [114%]  184 As with crypto, GEMM benefits greatly from NUMA support with an incredible 114% improvement by allocating the (3) buffers on the right NUMA nodes.
BenchScience DGEMM (GFLOPS) double/FP64  183 [131%]  79 Changing to FP64 brings an even more incredible 131%.
BenchScience SFFT (GFLOPS) float/FP32  11.6 [86%]  6.25 FFT also shows big gains from NUMA support with 86% improvement just by allocating the buffers (2+1 const) on the right nodes.
BenchScience DFFT (GFLOPS) double/FP64  10.6 [112%]  5 With FP64 again increases
BenchScience SNBODY (GFLOPS) float/FP32  479 [=]  483 Strangely N-Body does not benefit much from NUMA support with no appreciable improvement.
BenchScience DNBODY (GFLOPS) double/FP64  189 [=]  191 With FP64 workload nothing much changes.
As with crypto, buffer heavy algorithms (GEMM, FFT, N-Body) greatly benefit from NUMA support with performance doubling (86-131%) by allocating on the right NUMA nodes; in effect TR needs NUMA in order to perform better than a standard Ryzen!
CPU Image Processing Blur (3×3) Filter (MPix/s)  2090 [+71%]  1220 Least compute brings highest benefit from NUMA support – here it is 71%.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  886 [=]  890 Same algorithm but more compute brings the improvement to nothing.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  494 [=]  495 Again same algorithm but even more compute again no benefit.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  720 [=]  719 Using two buffers does not seem to show any benefit either.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  116 [=]  117 Different algorithm keeps with more compute means no benefit either.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  40.3 [=]  40.7 Using the new scatter/gather in AVX2 does not help matters even with NUMA support.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1880 [+90%]  982 Here we have a 64-bit integer workload algorithm with many gathers not compute heavy brings 90% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  397 [=]  396 Heavy compute brings down the improvement to nothing.
As with other SIMD tests,  low compute algorithms see 70-90% improvement from NUMA support; heavy compute algorithms bring the improvement down to zero. It all depends whether the overhead of accessing other nodes can be masked by compute; in effect TR seems to perform pretty well.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that ThreadRipper needs NUMA support in applications – just like any other SMP system today to shine: we see over 2x improvement in bandwidth-heavy algorithms. However, in compute-heavy algorithms TR is able to mask the overhead pretty well – with NUMA bringing almost no improvement. For non NUMA supporting software the UMA mode should be employed.

Let’s remember we are only testing a 2-node system, here, a 4+ node system is likely to show higher improvements and with EPYC systems stating at 1-socket 4-node we can potentially have common 4-socket 16-node systems that absolutely need NUMA for best performance. We look forward to testing such a system as soon as possible 😉

 

Sandra Platinum (2017) SP1a

Update Wizard

We are pleased to release SP1a (Service Pack 1a – version 24.30) update for Sandra Platinum (2017) with the following updates:

Sandra Platinum (2017) Press Release

  • Tools update allowing further ports of benchmarks to AVX512, e.g.:
    • CPU Multi-Media: 128-bit (octa) floating-point benchmark
    • CPU Scientific: GEMM and N-Body (single and double floating-point)
    • CPU Image Processing: SIMD filters (AVX2/FMA3, AVX, SSE4*, SSE2) performance improvement
    • further benchmarks will be enabled as tools are further updated

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Sandra Platinum (2017) SP1

We are pleased to announce SP1 (Service Pack 1 – version 24.27) for Sandra Platinum (2017) with updated hardware support and benchmark optimisations:

Sandra Platinum (2017) Press Release

  • Updated hardware support to including:
    • Intel HEDT/Workstation/Server Skylake-X/Kabylake-X
    • Intel Core Cofeelake
    • AMD HEDT/Workstation/Server Threadripper
    • Updated DDR4, NVMem (non-volatile), PMem (persistent) memory support
  • Updated CPU benchmarks including:
    • Updated AVX512 benchmarks (Multi-Media, Cryptography/Hashing, Memory & Cache Bandwidth)
    • Further benchmarks will be updated to AVX512 in due course
  • Benchmark Fixes
    • Fix: CPU Power Management Efficiency benchmark running with more than 16 threads.
    • Fix: SGEMM AVX2/FMA running with non-power of 2 threads
    • Fix: Database AVX512 scores would not be entered

New hardware reviews with Sandra Platinum (2017) SP1:

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra Platinum updated to RTMa

We are updating Sandra Platinum (2017) to RTMa (version 24.18) with a few important fixes and optimizations. Please update to this version as soon as applicable.

Sandra 2017 Press Release

* Windows 7/Server 2008 R2: failure to launch due to missing TPM support.

If the BitLocker/TPM2 update was not installed and TPM support were missing Sandra would fail to launch.

* AMD Ryzen: updated support and fixes.

Better detection and information through further testing.

* Intel Atom Braswell and later: multiplier detection issues.

Benchmarks would fail to validate due to incorrect values.

– CPU Scientific: crash using the SSE2/3 FFT implementation on Atom CPUs.

* GPGPU: optimisations and fixes.

– OpenCL: optimisations for image processing, especially relating to FP16/half processing:

FP16 GPGPU Image Processing Performance & Quality

– DX ComputeShader: fixes for image processing enabling both FP32 and FP16 performance.

* GPGPU: test data randomised.

– Cryptography: CUDA, OpenCL, DX CS: replaced with random data which reduces cache locality and thus performance especially on APUs. AES performance has thus reduced by up to 50-70%.

Using randomised/high-entropy data is meant to improve the “fairness” of the benchmarks preventing “best-case” scenarios where results may be better than expected. While hardware may thus perform better given low-entropy data – that is preferable to the previous results.

* Benchmark Results Ranker: old results (pre Sandra 2014) removed.

– The reference/aggregated results have been updated with newer 2017 results given higher weight.

Download Sandra Lite

Sandra USB supplied SanDisk Extreme disks

We are now providing Sandra USB versions on one of the fastest USB3 flash disks – the SanDisk Extreme! With read bandwidth ~200MB/s you won’t be waiting for Sandra to start off the USB drive (write speed is not bad either at ~60MB/s).

Now they are not the smallest of flash drives but then again you are unlikely to lose them. Here they are for your viewing pleasure:

Sandra USB on SanDisk Extreme

Sandra USB on SanDisk Extreme

SP1a for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1a (Service Pack 1a) to SiSoftware Sandra 2016.

This is a minor update that improves stability and adds a few optimisations that were developed after further testing of SP1 release.

The SP1a update also enables the Marbling: Perlin Noise 2D (3 octaves) Filter for both GPGPUs (CUDA, OpenCL) and CPU.

Sandra 2016 SP1 New Image Filters

SP1 for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1 (Service Pack 1) to SiSoftware Sandra 2016.

This release introduces initial AVX512 benchmarks with all SIMD benchmarks due to be ported once compiler support becomes available:

CPU Multi-Media (Fractal Generation): single, double floating-point; integer, long benchmarks ported to AVX512. [See article Future performance with AVX512]

CPU Crypto (SHA Hashing): SHA2-256 and SHA2-512 multi-buffer ported to AVX512.

– Hardware support for future arch (AMD and Intel).

.Net Multi-Media native vector support is vector width independent and thus will support AVX512 with a future CLR release automatically

GPU Image Processing: New, more complex filters:

  • Oil Painting: Quantise (9×9) Filter: CUDA, OpenCL
  • Diffusion: Randomise (256) Filter: CUDA, OpenCL
  • Marbling: Perlin Noise 2D (3 octaves) Filter: CUDA, OpenCL

CPU Image Processing: New, more complex filters

  • Oil Painting: Quantise (9×9) Filter: AVX2/FMA, AVX, SSE2
  • Diffusion: Randomise (256) Filter: AVX2/FMA, AVX, SSE2
  • Marbling: Perlin Noise 2D (3 octaves) Filter: AVX2/FMA, AVX, SSE2

Sandra 2016 SP1 New Image FiltersMore benchmarks will be ported to AVX512 subject to compiler support; currently Microsoft’s VC++ does not support AVX512 intrinsics and in the interest of fairness we do not use specialised compilers.

Please see our article – Future performance with AVX512 – for a primer on AVX512 and projected performance improvements due to AVX512 and 512-bit transfers.