You are viewing bgoglin

Brice Goglin's Blog - Fun with SuperMicro BIOS and PCI-NUMA

Nov. 5th, 2009

18:57 - Fun with SuperMicro BIOS and PCI-NUMA

Previous Entry Share Next Entry

We have a SuperMicro machine with a X8DAH motherboard at work. It contains 2 Intel Xeon Nehalem X5550 (8 cores, 16 threads total) with 3 GPUs. As several Nehalem motherboards, there are actually 2 IO hubs, one near each socket.

  ---------   ------------   ------------   ---------
  | Mem#0 |===| Socket#0 |===| Socket#1 |===| Mem#1 |
  ---------   ------------   ------------   ---------
                   ||             ||
               -----------   -----------
               | IOHub#0 |===| IOHub#1 |
               -----------   -----------
                   ||             ||
                 GPU#0         GPU#1+2

So PCI devices behind one IO Hub are closer to one socket than to the other one. So DMA performance depends on where the target memory is located: in the memory near one socket, or in the other memory node. The motherboard manual tells us which PCI slots are actually behind which IO hub (and thus near which socket/memory). And benchmarking our GPUs confirms the actual position of each PCI devices in the above picture. But we want to find out such information automatically to ease deployment and portability of applications. Linux may report such information through sysfs:

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14

However, this is wrong since 0,2,4,6,8,10,12,14 means near socket #0 while 2 GPUs are actually near socket #1 (CPUs 1,3,5,7,9,11,13,15). This could have been a bug in the Linux kernel, but it's actually a bug in the BIOS (Linux just needs to report what the BIOS tells). So we talked to SuperMicro about it and tried upgrading the BIOS.


The first BIOS upgrade (from 1.0 to 1.0b) went kind of bad: the machine didn't boot anymore at all, not even any BIOS message on screen. Fortunately, we removed the GPUs and it booted again. But Linux didn't have any NUMA information at all. It was just saying there was a single NUMA node instead of 2. So we just forgot about all this mess and downgraded back to the older BIOS.

Another BIOS update came out recently (1.0c) so I contacted SuperMicro to know if it was worth upgrading. At some point, they asked me to try disabling NUMA in the current BIOS. The machine didn't boot anymore... except after removing some GPUs. Exactly as above. It seems that there is an incompatibility between disabling NUMA in the BIOS and having multiple GPUs in the machine. And the first BIOS upgrade apparently disabled NUMA by default, causing all the above problems with BIOS 1.0b.


So we had to try upgrading again, and make sure NUMA wasn't left disabled by default again. Instead of going back to 1.0b, I upgraded the BIOS to the latest release (1.0c) directly. And now the machine finally reports the right PCI-NUMA information!

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0-3,8-11
  4-7,12-15
  4-7,12-15

You might have noticed that CPU numbering changed in the meantime (CPU number interleaving is different), but I don't care since we have hwloc (Hardware Locality) to deal with it. Now the development version of our lstopo tool reports the whole machine topology, including PCI, as expected:


In short, if you have a X8DAH motherboard, don't disable NUMA in the BIOS (why would you do that anyway?) since it causes boot failures in some cases (when 3 GPUs are connected here), and upgrade to 1.0c if you care about memory/PCI locality/performance (which is probably the case anyway).

(Permanent link

Tags: ,

Comments:

From:Kenneth Lloyd
Date:February 13th, 2012 01:16 (UTC)

Thanks for posting!

(Link)
Wish I'd realized this was a BIOS reporting issue earlier. Have you tried constructing MPI 2.2 Dist_graphs with this info - esp. for CUDA RDMA with multiple GPUs?
(Reply) (Thread)
From:bgoglin
Date:February 13th, 2012 05:45 (UTC)

Re: Thanks for posting!

(Link)
I am not very familiar with what you would do with such a dist_graph. There are so many ideas related to hwloc in OpenMPI that it's hard to follow all of them. At least, IIRC, some people from Bull use hwloc to have some I/O device (mostly IB NICs) distance inside Carto. Doing the same for GPUs would be very easy.
(Reply) (Parent) (Thread)