You are viewing bgoglin

Brice Goglin's Blog - Post a comment

Nov. 5th, 2009

18:57 - Fun with SuperMicro BIOS and PCI-NUMA

We have a SuperMicro machine with a X8DAH motherboard at work. It contains 2 Intel Xeon Nehalem X5550 (8 cores, 16 threads total) with 3 GPUs. As several Nehalem motherboards, there are actually 2 IO hubs, one near each socket.

  ---------   ------------   ------------   ---------
  | Mem#0 |===| Socket#0 |===| Socket#1 |===| Mem#1 |
  ---------   ------------   ------------   ---------
                   ||             ||
               -----------   -----------
               | IOHub#0 |===| IOHub#1 |
               -----------   -----------
                   ||             ||
                 GPU#0         GPU#1+2

So PCI devices behind one IO Hub are closer to one socket than to the other one. So DMA performance depends on where the target memory is located: in the memory near one socket, or in the other memory node. The motherboard manual tells us which PCI slots are actually behind which IO hub (and thus near which socket/memory). And benchmarking our GPUs confirms the actual position of each PCI devices in the above picture. But we want to find out such information automatically to ease deployment and portability of applications. Linux may report such information through sysfs:

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14

However, this is wrong since 0,2,4,6,8,10,12,14 means near socket #0 while 2 GPUs are actually near socket #1 (CPUs 1,3,5,7,9,11,13,15). This could have been a bug in the Linux kernel, but it's actually a bug in the BIOS (Linux just needs to report what the BIOS tells). So we talked to SuperMicro about it and tried upgrading the BIOS.


The first BIOS upgrade (from 1.0 to 1.0b) went kind of bad: the machine didn't boot anymore at all, not even any BIOS message on screen. Fortunately, we removed the GPUs and it booted again. But Linux didn't have any NUMA information at all. It was just saying there was a single NUMA node instead of 2. So we just forgot about all this mess and downgraded back to the older BIOS.

Another BIOS update came out recently (1.0c) so I contacted SuperMicro to know if it was worth upgrading. At some point, they asked me to try disabling NUMA in the current BIOS. The machine didn't boot anymore... except after removing some GPUs. Exactly as above. It seems that there is an incompatibility between disabling NUMA in the BIOS and having multiple GPUs in the machine. And the first BIOS upgrade apparently disabled NUMA by default, causing all the above problems with BIOS 1.0b.


So we had to try upgrading again, and make sure NUMA wasn't left disabled by default again. Instead of going back to 1.0b, I upgraded the BIOS to the latest release (1.0c) directly. And now the machine finally reports the right PCI-NUMA information!

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0-3,8-11
  4-7,12-15
  4-7,12-15

You might have noticed that CPU numbering changed in the meantime (CPU number interleaving is different), but I don't care since we have hwloc (Hardware Locality) to deal with it. Now the development version of our lstopo tool reports the whole machine topology, including PCI, as expected:


In short, if you have a X8DAH motherboard, don't disable NUMA in the BIOS (why would you do that anyway?) since it causes boot failures in some cases (when 3 GPUs are connected here), and upgrade to 1.0c if you care about memory/PCI locality/performance (which is probably the case anyway).

(Permanent link

Tags: ,

Leave a comment:

No HTML allowed in subject

  
 
   
 

Notice! This user has turned on the option that logs IP addresses of anonymous posters. 

(will be screened)