Industry Blog

Memory Configuration and Performance on Scalable Processors

Memory Configuration and Performance

November 9, 2019 by Adam Jundt

We often see requests for multiple memory configurations ranging from 2-12 DIMMs per socket. Some of our customers would like to have 128/256/512/1024GB of RAM for the latest generation of Intel Xeon Scalable processors (Skylake and Cascade Lake). The Xeon Scalable processors support six memory channels and up to DDR4-2933 DIMMs. To take advantage of all memory channels, we would need to use 96/192/384/768/1536GB of RAM.

If instead we were to populate at 256GB of memory on a dual socket system via 8x 32GB DIMMs, we would only be using 4 of 6 memory channels/socket. This configuration will still work, but memory bandwidth performance will suffer. To find out by how much, we ran a simple memory benchmark, STREAM, in multiple memory configurations on a couple of motherboards and present our findings.

About the Benchmark

The STREAM benchmark was used to test memory performance. The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. The benchmark was created and maintained by John McCalpin from the University of Virginia [1].

I used the slightly modified version of STREAM from NERSC’s trinity benchmarks. No other changes were made to the source. CentOS 7 was installed with default configuration. From there the benchmark was compiled via:

gcc -fopenmp -O2 -fpic -mcmodel=large -D_OPENMP -DNTIMES=100 -DN=80000000 stream.c -o stream.exe

Where NTIMES is the number of times to run each test and N is the array size (there are 3 arrays, this would set each array size at ~600MB, well above the L3 cache size).

Cascade Lake Setup

The system used was a Tyan Thunder HX FT77DB7109. The motherboard supports dual Intel Cascade Lake processors and up to 24 DIMMs. The system was configured with 2x Xeon Gold 6240 processors (18 core, 2.60GHz), and 32GB DDR4-2933 ECC DIMMs.

Results – Cascade Lake

Total DIMM count was modified from 4 DIMMs (2/socket) to 8, and 12. Benchmarks were run via:

export OMP_NUM_THREADS=36 && ./stream.ex

Memory Bandwidth Graph

Figure 1: Memory bandwidth performance on Cascade Lake processors modifying the number DIMMs in the system. Higher is better.

From the results we can see that not populating all memory channels can result in up to a ~65% memory bandwidth loss.

Conclusion

We strongly recommend customers to fully populate all 6 memory channels to get the best performance out of their system. With falling memory prices (already down ~50% from last year) adding additional DIMMs will hopefully be within the budget for your servers. I’ll be running another test soon to show how real application performance is affected by memory population and will be posting the results on our website.

Contact Us

If you’d like to discuss configuring your next Xeon Scalable server for optimal memory performance, please reach out to me at adam.jundt@advancedhpc.com.

References

[1] McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.

Adam Jundt

Adam Jundt has worked in High Performance Computing for over 10 years. Starting as a user in grad school (MS in Computer Science with focus on HPC), MPI Developer at Cray, Performance Support at CCT@LSU and SDSC@UCSD for XSEDE researchers, Developer at EP Analytics, and Senior Sales Engineer at Advanced HPC. He has published and presented HPC research at local and international conferences and co-authored proposals that resulted in funding from DOE, DOD, NASA, and NSF. His work has focused on extracting the best energy and performance out of HPC systems and finding matches between application needs and available hardware.