You are viewing bgoglin

Brice Goglin's Blog

Nov. 5th, 2009

18:57 - Fun with SuperMicro BIOS and PCI-NUMA

We have a SuperMicro machine with a X8DAH motherboard at work. It contains 2 Intel Xeon Nehalem X5550 (8 cores, 16 threads total) with 3 GPUs. As several Nehalem motherboards, there are actually 2 IO hubs, one near each socket.

  ---------   ------------   ------------   ---------
  | Mem#0 |===| Socket#0 |===| Socket#1 |===| Mem#1 |
  ---------   ------------   ------------   ---------
                   ||             ||
               -----------   -----------
               | IOHub#0 |===| IOHub#1 |
               -----------   -----------
                   ||             ||
                 GPU#0         GPU#1+2

So PCI devices behind one IO Hub are closer to one socket than to the other one. So DMA performance depends on where the target memory is located: in the memory near one socket, or in the other memory node. The motherboard manual tells us which PCI slots are actually behind which IO hub (and thus near which socket/memory). And benchmarking our GPUs confirms the actual position of each PCI devices in the above picture. But we want to find out such information automatically to ease deployment and portability of applications. Linux may report such information through sysfs:

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14

However, this is wrong since 0,2,4,6,8,10,12,14 means near socket #0 while 2 GPUs are actually near socket #1 (CPUs 1,3,5,7,9,11,13,15). This could have been a bug in the Linux kernel, but it's actually a bug in the BIOS (Linux just needs to report what the BIOS tells). So we talked to SuperMicro about it and tried upgrading the BIOS.


The first BIOS upgrade (from 1.0 to 1.0b) went kind of bad: the machine didn't boot anymore at all, not even any BIOS message on screen. Fortunately, we removed the GPUs and it booted again. But Linux didn't have any NUMA information at all. It was just saying there was a single NUMA node instead of 2. So we just forgot about all this mess and downgraded back to the older BIOS.

Another BIOS update came out recently (1.0c) so I contacted SuperMicro to know if it was worth upgrading. At some point, they asked me to try disabling NUMA in the current BIOS. The machine didn't boot anymore... except after removing some GPUs. Exactly as above. It seems that there is an incompatibility between disabling NUMA in the BIOS and having multiple GPUs in the machine. And the first BIOS upgrade apparently disabled NUMA by default, causing all the above problems with BIOS 1.0b.


So we had to try upgrading again, and make sure NUMA wasn't left disabled by default again. Instead of going back to 1.0b, I upgraded the BIOS to the latest release (1.0c) directly. And now the machine finally reports the right PCI-NUMA information!

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0-3,8-11
  4-7,12-15
  4-7,12-15

You might have noticed that CPU numbering changed in the meantime (CPU number interleaving is different), but I don't care since we have hwloc (Hardware Locality) to deal with it. Now the development version of our lstopo tool reports the whole machine topology, including PCI, as expected:


In short, if you have a X8DAH motherboard, don't disable NUMA in the BIOS (why would you do that anyway?) since it causes boot failures in some cases (when 3 GPUs are connected here), and upgrade to 1.0c if you care about memory/PCI locality/performance (which is probably the case anyway).

(Permanent link

Tags: ,

Dec. 27th, 2006

09:46 - Messing with the stack of PCI saved states

While testing a patch regarding the saving/restoring of MSI and PCI-Express states in the myri10ge driver, we discovered that recent changes in how the kernel saves those states result in problems with how we use it. pci_save/restore_state() are used by drivers to save the PCI registers (the configuration space) status in the host memory before suspending a device. With the addition of MSI and PCI-Express registers in recent kernel routines, the way the registers are saved has been converted to a stack. This looks fine for a normal usage: push on the stack before suspend, pop during resume. But, it is actually not fine when you save the state more often than you restore it: you push too much stuff on the stack, without ever freeing it, i.e. you leak some memory.

But, why the hell would you save the state too often? The myri10ge driver can recover from a memory parity error in the network interface. When a parity error occur, the interface resets and the drivers restores its previous state. But, we don't know when such an error will occur. Therefore, the state has to be saved in advance. Then, if you suspend your machine, the PCI layer saves the state again, which means you duplicate the saved registers on the stack as explained above.

Some patches are in the queued to balance the calls to pci_save/restore_state() in the driver so that the stack always contain a single set of saved registers. But, it might be better if the whole parity recovery process was changed, since all this looks like a mess...

(Permanent link

Tags: , ,

Aug. 10th, 2006

00:30 - MSI detection patches ready?

Over the last 2 months, I have been sending multiple patches to the Linux Kernel mailing list to improve the way the kernel detects whether it should enable MSI (Message Signaled Interrupts) on a device or not. The main reason for this work (apart from the fact that MSI reduces the interrupt latency from about 10 to 5 us) comes from kernels until 2.6.16 disabling MSI on _all_ devices on machines that contain a AMD 8131 chipset (which does not support MSI). The problem is that only a couple PCI devices are generally located behind this chipset, while all other devices (including all the PCI-Express ones) are not related to it at all. Hence, there was no reason to disable MSI on all these latter devices.

Read more...Collapse )

For those that are interested, the patches are available in the -mm kernel through Greg K-H's PCI patchset and will probably end up in 2.6.19.

(Permanent link

Tags: ,