Can someone explain how NUMA is not always best for multi-socket systems? - 2ページ

hardware_guy · ‎05-13-2013

I have a dual-socket Westmere server running 2 six-core X5680 CPUs with NUMA enabled.

I have been reading about the benefits of NUMA when software is designed with NUMA in mind, but then I've also been reading about how NUMA can harm performance if the software does not specifically take it into account. How is this possible? My understanding was that disabling NUMA brings memory access from both CPUs down to the lowest common denominator of remote memory. For example, local memory access might be 0.5 milliseconds and remote memory access at another node might be 1 millisecond with NUMA enabled; but with NUMA disabled, all memory access is 1 millisecond. Am I thinking about this the right way? Can someone explain some cases where NUMA with non-NUMA software might not work well?

TimP · ‎05-07-2017

In the context of the types of systems usually discussed here, a NUMA system is one which has memory banks associated with processor packages, and access by one CPU to memory associated with another requires additional steps. Many early dual CPU systems which supported x86_64 were delivered in a non-NUMA BIOS configuration, mentioned earlier in this thread, where alternate cache lines were local and remote, so that non-NUMA aware applications would perform about the same on either CPU even if suspended and resumed, but wouldn't be capable of full memory performance.

Enabling NUMA mode in the BIOS then requires setting affinity for the application to maximize performance by having each thread use primarily local memory. For example, the OpenMP environment variables OMP_PLACES or OMP_PROC_BIND, or Intel-specific KMP_HW_SUBSET would be useful. Models such as cilk(tm) plus may exclude such optimization.

The NUMA term also comes into play for those CPUs where groups of hardware threads share cache, which also requires thread affinity for full performance. For example, the MIC KNL has 8 hardware threads, split among 2 cores, attached to each level 2 cache tile.