Home

Brice Goglin's Blog

Nov. 5th, 2009

18:57 - Fun with SuperMicro BIOS and PCI-NUMA

We have a SuperMicro machine with a X8DAH motherboard at work. It contains 2 Intel Xeon Nehalem X5550 (8 cores, 16 threads total) with 3 GPUs. As several Nehalem motherboards, there are actually 2 IO hubs, one near each socket.

  ---------   ------------   ------------   ---------
  | Mem#0 |===| Socket#0 |===| Socket#1 |===| Mem#1 |
  ---------   ------------   ------------   ---------
                   ||             ||
               -----------   -----------
               | IOHub#0 |===| IOHub#1 |
               -----------   -----------
                   ||             ||
                 GPU#0         GPU#1+2

So PCI devices behind one IO Hub are closer to one socket than to the other one. So DMA performance depends on where the target memory is located: in the memory near one socket, or in the other memory node. The motherboard manual tells us which PCI slots are actually behind which IO hub (and thus near which socket/memory). And benchmarking our GPUs confirms the actual position of each PCI devices in the above picture. But we want to find out such information automatically to ease deployment and portability of applications. Linux may report such information through sysfs:

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14

However, this is wrong since 0,2,4,6,8,10,12,14 means near socket #0 while 2 GPUs are actually near socket #1 (CPUs 1,3,5,7,9,11,13,15). This could have been a bug in the Linux kernel, but it's actually a bug in the BIOS (Linux just needs to report what the BIOS tells). So we talked to SuperMicro about it and tried upgrading the BIOS.


The first BIOS upgrade (from 1.0 to 1.0b) went kind of bad: the machine didn't boot anymore at all, not even any BIOS message on screen. Fortunately, we removed the GPUs and it booted again. But Linux didn't have any NUMA information at all. It was just saying there was a single NUMA node instead of 2. So we just forgot about all this mess and downgraded back to the older BIOS.

Another BIOS update came out recently (1.0c) so I contacted SuperMicro to know if it was worth upgrading. At some point, they asked me to try disabling NUMA in the current BIOS. The machine didn't boot anymore... except after removing some GPUs. Exactly as above. It seems that there is an incompatibility between disabling NUMA in the BIOS and having multiple GPUs in the machine. And the first BIOS upgrade apparently disabled NUMA by default, causing all the above problems with BIOS 1.0b.


So we had to try upgrading again, and make sure NUMA wasn't left disabled by default again. Instead of going back to 1.0b, I upgraded the BIOS to the latest release (1.0c) directly. And now the machine finally reports the right PCI-NUMA information!

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0-3,8-11
  4-7,12-15
  4-7,12-15

You might have noticed that CPU numbering changed in the meantime (CPU number interleaving is different), but I don't care since we have hwloc (Hardware Locality) to deal with it. Now the development version of our lstopo tool reports the whole machine topology, including PCI, as expected:


In short, if you have a X8DAH motherboard, don't disable NUMA in the BIOS (why would you do that anyway?) since it causes boot failures in some cases (when 3 GPUs are connected here), and upgrade to 1.0c if you care about memory/PCI locality/performance (which is probably the case anyway).

(Permanent link

Tags: ,

Jul. 29th, 2008

19:06 - MMU notifiers brings into Linux what we've been wanted for HPC for a while

After the addition of ioremap_wc() in 2.6.26, MMU notifiers have now been merged in 2.6.27-rc1. It means that everything we have been wanting in the past to help HPC support is finally available upstream. We thought IB being merged (back in 2.6.11) would make things go fast, but it looks like these important features were not that obvious to people that did not work on HPC for a long time.

Back in 2004, I was trying to get a safe registration cache working in the kernel for distributed storage over Myrinet. User-space regcaches are known to be a mess because they need to intercept malloc/free/munmap to invalidate cached segments. It works sometimes, but it is often a mess. In the kernel, you just can't intercept anything. So I wrote a patch called VMASpy which allowed other subsystems to be notified when part of a "registered" VMA is unmapped or forked. I never submitted it since it couldn't be accepted unless somebody in the kernel (i.e. IB) used it. Given posts like this, we see that IB people weren't conscious of the problem (nowadays they are interested but something in the IB specs apparently prevents them from using this).

KVM needed some kernel support for its shadow pages, so MMU notifiers were written by Andrea Arcangeli (thanks a lot to him for keeping working on this despite many people not liking it). After a couple months of trolls, here we go with 2.6.27-rc1, we can now register a notifier per mm_struct and get a callback when part of the address space is unmapped. The implementation is very different from my VMASpy and of course much better :) But the final API provides similar features, so it should be great news for people working on registration caches or so.

(Permanent link

Tags: , ,

18:48 - myri10ge broken in 2.6.26, will be fixed in 2.6.26.1

The myri10ge driver (Ethernet driver for Myri-10G boards) is broken in 2.6.26. It may not do anything at startup. It may also oops when opening the interface. The breakage appeared because the big pile of updates sent for 2.6.26 has been only partially applied (multislice RX is only applied in 2.6.27), and I did not test it intensively enough. Apologies.

2.6.27-rc1 is not affected by the breakage. And 2.6.25 works fine as well. Two patches have been sent to the stable release team for inclusion in 2.6.26.1. In the meantime, you may use Myricom's tarball, take the driver from 2.6.27-rc1 or from 2.6.25, ... or just not use 2.6.26 :)

(Permanent link

Tags: ,

Dec. 27th, 2006

09:46 - Messing with the stack of PCI saved states

While testing a patch regarding the saving/restoring of MSI and PCI-Express states in the myri10ge driver, we discovered that recent changes in how the kernel saves those states result in problems with how we use it. pci_save/restore_state() are used by drivers to save the PCI registers (the configuration space) status in the host memory before suspending a device. With the addition of MSI and PCI-Express registers in recent kernel routines, the way the registers are saved has been converted to a stack. This looks fine for a normal usage: push on the stack before suspend, pop during resume. But, it is actually not fine when you save the state more often than you restore it: you push too much stuff on the stack, without ever freeing it, i.e. you leak some memory.

But, why the hell would you save the state too often? The myri10ge driver can recover from a memory parity error in the network interface. When a parity error occur, the interface resets and the drivers restores its previous state. But, we don't know when such an error will occur. Therefore, the state has to be saved in advance. Then, if you suspend your machine, the PCI layer saves the state again, which means you duplicate the saved registers on the stack as explained above.

Some patches are in the queued to balance the calls to pci_save/restore_state() in the driver so that the stack always contain a single set of saved registers. But, it might be better if the whole parity recovery process was changed, since all this looks like a mess...

(Permanent link

Tags: , ,

Sep. 20th, 2006

00:30 - Linux 2.6.18 is out with the Myri-10G Ethernet driver

Over the last 4 months, I have been sending patches to include the myri10ge driver into the Linux kernel. Linux 2.6.18 has just been released, it is the first kernel to include myri10ge.

Read more... )

(Permanent link

Tags: ,

Aug. 16th, 2006

21:19 - What HPC Networking Requires from the Linux Kernel

Since I am working on HPC-networking drivers, my company made me write an article for HPCwire's LinuxWorld Expo coverage about what problems we have to deal with the Linux kernel (it is the best OS for HPC, but there are still problems) and what support we would like to get: What HPC Networking Requires from the Linux Kernel.

As usual, some people will reply that getting the driver merged in Linux and only supporting recent kernels would make things much easier. But, we do not decide what kernel our customers want to use, so...

(Permanent link

Tags: ,

Aug. 10th, 2006

00:30 - MSI detection patches ready?

Over the last 2 months, I have been sending multiple patches to the Linux Kernel mailing list to improve the way the kernel detects whether it should enable MSI (Message Signaled Interrupts) on a device or not. The main reason for this work (apart from the fact that MSI reduces the interrupt latency from about 10 to 5 us) comes from kernels until 2.6.16 disabling MSI on _all_ devices on machines that contain a AMD 8131 chipset (which does not support MSI). The problem is that only a couple PCI devices are generally located behind this chipset, while all other devices (including all the PCI-Express ones) are not related to it at all. Hence, there was no reason to disable MSI on all these latter devices.

Read more... )

For those that are interested, the patches are available in the -mm kernel through Greg K-H's PCI patchset and will probably end up in 2.6.19.

(Permanent link

Tags: ,

Jul. 23rd, 2006

17:20 - Back from OLS 2006

I just came back from the Linux Symposium in Ottawa. It was great. Here are some talks that I enjoyed:

I didn't find the slides on-line so far, but at least the associated articles are available in the Proceedings.

A couple bad things about this symposium anyway:
(Permanent link

Tags: