You are viewing bgoglin

Brice Goglin's Blog

Jul. 19th, 2010

14:31 - Remote Console Access with IPMI on Dell R710

Our local servers are moving from Dell Poweredge 2950 to R710. A couple years ago, I wrote a guide for Remote Console Access through IPMI 2.0 on the 2950. Some noticeable changes are needed for R710, so here's a new updated guide. I also added some notes about R815 and R720 at the end, since they are very similar.

You should first choose a new sub-network for IPMI. Although the IPMI network traffic uses a regular physical interface, it has a different MAC address and should use different IP address. If your boxes have 10.0.0.x regular IP addresses, you may for instance use 10.0.99.x for IPMI. Adding corresponding hostnames (for instance xxx-ipmi for host xxx) in your DNS or /etc/hosts file might be good too.

At the end of the BIOS boot, press Ctrl-e to enter the Remote Access Setup and enable actual IPMI Remote Access (note that some models can also be configured from Linux using ipmitool after loading some ipmi kernel modules).

IPMI is now configured correctly. You should be able to ping the IPMI IP addresses for the master node (assuming you properly enabled the 10.0.99.x network there).

    $ ping 10.0.99.x

Now, you may for instance reboot a node using the following line. Replace cycle with status to see the status, off to shutdown, on to start.

    $ ipmitool -I lan -H 10.0.99.x -U login -P passwd chassis power cycle

Now we need to configure console redirection. It makes it possible to send the BIOS, GRUB, and kernel output through IPMI on the network. Note that the Second Serial port should be used. So usually you will use COM2/ttyS1. After booting, press F2 to enter the BIOS. Go in the Serial Communication menu:

With this configuration, you should see the BIOS and GRUB output remotely using:

    $ ipmitool -I lanplus -H 10.0.99.x -U login -P password sol activate

Then we want to see the kernel booting remotely. This is done by adding the following to the kernel command line:

    console=ttyS1,115200n8 console=tty0

With GRUB2 on Debian, you should open /etc/default/grub and add these options to GRUB_CMDLINE_LINUX. By the way, you probably want to uncomment GRUB_TERMINAL=console and remove the quiet option nearby. Everything will be propagated to /boot/grub/grub.cfg when running update-grub.

And finally, you might want to get a console login remotely through IPMI. To do so, add the following line to /etc/inittab:

    T0:23:respawn:/sbin/getty -L ttyS1 115200n8 vt100

With all this setup, the above ipmitool sol activate line will display the same thing than the physical console on the machine, which makes it very nice to configure the BIOS, change the kernel, debug, ... Note that ~ is the control character when using the console redirection. And ~. may be used to leave the console. Also ipmitool sol deactivate may help if somebody did not leave the console correctly.

Update for R815 (2012/05/30): The configuration for the R815 is very similar. I met some harder constraints about serial device configuration in the BIOS, everything is already explained above.

Update for R720 (2012/05/31): On recent PowerEdge models, the IPMI config is directly available in the BIOS setup menus, no need to hit Ctrl-e during boot anymore. Just go in the BIOS with F2 as usual, then enter the iDRAC config. The following menus are similar to those described above.
The other difference is that the R720 doesn't seem to work well with the IPMI lan interface. Always passing lanplus instead of lan to ipmitool -I seems to work fine.

(Permanent link

Tags: ,

Nov. 5th, 2009

18:57 - Fun with SuperMicro BIOS and PCI-NUMA

We have a SuperMicro machine with a X8DAH motherboard at work. It contains 2 Intel Xeon Nehalem X5550 (8 cores, 16 threads total) with 3 GPUs. As several Nehalem motherboards, there are actually 2 IO hubs, one near each socket.

  ---------   ------------   ------------   ---------
  | Mem#0 |===| Socket#0 |===| Socket#1 |===| Mem#1 |
  ---------   ------------   ------------   ---------
                   ||             ||
               -----------   -----------
               | IOHub#0 |===| IOHub#1 |
               -----------   -----------
                   ||             ||
                 GPU#0         GPU#1+2

So PCI devices behind one IO Hub are closer to one socket than to the other one. So DMA performance depends on where the target memory is located: in the memory near one socket, or in the other memory node. The motherboard manual tells us which PCI slots are actually behind which IO hub (and thus near which socket/memory). And benchmarking our GPUs confirms the actual position of each PCI devices in the above picture. But we want to find out such information automatically to ease deployment and portability of applications. Linux may report such information through sysfs:

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14
  0,2,4,6,8,10,12,14

However, this is wrong since 0,2,4,6,8,10,12,14 means near socket #0 while 2 GPUs are actually near socket #1 (CPUs 1,3,5,7,9,11,13,15). This could have been a bug in the Linux kernel, but it's actually a bug in the BIOS (Linux just needs to report what the BIOS tells). So we talked to SuperMicro about it and tried upgrading the BIOS.


The first BIOS upgrade (from 1.0 to 1.0b) went kind of bad: the machine didn't boot anymore at all, not even any BIOS message on screen. Fortunately, we removed the GPUs and it booted again. But Linux didn't have any NUMA information at all. It was just saying there was a single NUMA node instead of 2. So we just forgot about all this mess and downgraded back to the older BIOS.

Another BIOS update came out recently (1.0c) so I contacted SuperMicro to know if it was worth upgrading. At some point, they asked me to try disabling NUMA in the current BIOS. The machine didn't boot anymore... except after removing some GPUs. Exactly as above. It seems that there is an incompatibility between disabling NUMA in the BIOS and having multiple GPUs in the machine. And the first BIOS upgrade apparently disabled NUMA by default, causing all the above problems with BIOS 1.0b.


So we had to try upgrading again, and make sure NUMA wasn't left disabled by default again. Instead of going back to 1.0b, I upgraded the BIOS to the latest release (1.0c) directly. And now the machine finally reports the right PCI-NUMA information!

  $ cat /sys/bus/pci/devices/0000:{02:00.0,84:00.0,85:00.0}/local_cpulist
  0-3,8-11
  4-7,12-15
  4-7,12-15

You might have noticed that CPU numbering changed in the meantime (CPU number interleaving is different), but I don't care since we have hwloc (Hardware Locality) to deal with it. Now the development version of our lstopo tool reports the whole machine topology, including PCI, as expected:


In short, if you have a X8DAH motherboard, don't disable NUMA in the BIOS (why would you do that anyway?) since it causes boot failures in some cases (when 3 GPUs are connected here), and upgrade to 1.0c if you care about memory/PCI locality/performance (which is probably the case anyway).

(Permanent link

Tags: ,

Jul. 29th, 2008

19:06 - MMU notifiers brings into Linux what we've been wanted for HPC for a while

After the addition of ioremap_wc() in 2.6.26, MMU notifiers have now been merged in 2.6.27-rc1. It means that everything we have been wanting in the past to help HPC support is finally available upstream. We thought IB being merged (back in 2.6.11) would make things go fast, but it looks like these important features were not that obvious to people that did not work on HPC for a long time.

Back in 2004, I was trying to get a safe registration cache working in the kernel for distributed storage over Myrinet. User-space regcaches are known to be a mess because they need to intercept malloc/free/munmap to invalidate cached segments. It works sometimes, but it is often a mess. In the kernel, you just can't intercept anything. So I wrote a patch called VMASpy which allowed other subsystems to be notified when part of a "registered" VMA is unmapped or forked. I never submitted it since it couldn't be accepted unless somebody in the kernel (i.e. IB) used it. Given posts like this, we see that IB people weren't conscious of the problem (nowadays they are interested but something in the IB specs apparently prevents them from using this).

KVM needed some kernel support for its shadow pages, so MMU notifiers were written by Andrea Arcangeli (thanks a lot to him for keeping working on this despite many people not liking it). After a couple months of trolls, here we go with 2.6.27-rc1, we can now register a notifier per mm_struct and get a callback when part of the address space is unmapped. The implementation is very different from my VMASpy and of course much better :) But the final API provides similar features, so it should be great news for people working on registration caches or so.

(Permanent link

Tags: , ,

18:48 - myri10ge broken in 2.6.26, will be fixed in 2.6.26.1

The myri10ge driver (Ethernet driver for Myri-10G boards) is broken in 2.6.26. It may not do anything at startup. It may also oops when opening the interface. The breakage appeared because the big pile of updates sent for 2.6.26 has been only partially applied (multislice RX is only applied in 2.6.27), and I did not test it intensively enough. Apologies.

2.6.27-rc1 is not affected by the breakage. And 2.6.25 works fine as well. Two patches have been sent to the stable release team for inclusion in 2.6.26.1. In the meantime, you may use Myricom's tarball, take the driver from 2.6.27-rc1 or from 2.6.25, ... or just not use 2.6.26 :)

(Permanent link

Tags: ,

Oct. 14th, 2007

11:57 - Remote Console Access with IPMI on Dell 2950

Update: New guide for Dell R710 servers.

I have been installing several Dell 2950 boxes recently and managed to configure Remote Console Access through IPMI 2.0. Since there are no nice/complete how-to in Google, here's one.

You should first choose a new sub-network for IPMI. Although the IPMI network traffic uses the same physical network than the first interface of the boxes (make sure this one is connected), it has different MAC addresses and should use different IP addresses. If your boxes have 10.0.0.x regular IP addresses, you may for instance use 10.0.99.x for IPMI. Adding corresponding hostnames (for instance xxx-ipmi for host xxx) in your DNS or /etc/hosts file might be good too.

At the end of the BIOS boot, press Ctrl-e to enter the Remote Access Setup and enable actual IPMI Remote Access (note that all this may also be configured from Linux using ipmitool after loading some ipmi kernel modules).

IPMI is now configured correctly. You should be able to ping the IPMI IP addresses.

    $ ping 10.0.99.x

Now, you may for instance reboot a node using the following line. Replace cycle with status to see the status, off to shutdown, on to start.

    $ ipmitool -I lan -H 10.0.99.x -U login -P passwd chassis power cycle

Now we need to configure console redirection. It makes it possible to send the BIOS, Grub, and ttyS1 output through IPMI on the network on the first network interface. Note that COM2/ttyS1 is mandatory, it may not be COM1/ttyS0 instead. After booting, press F2 to enter the BIOS. Go in the Serial Communication menu:

With this configuration, you should see the BIOS and Grub output remotely using:

    $ ipmitool -I lanplus -H 10.0.99.x -U login -P password sol activate

Then we want to see the kernel booting remotely. This is done by adding the following to the kernel command line. With Grub, you might want to add it to # kopt=... and then run update-grub to update all automatic entries.

    console=ttyS1,57600 console=tty0

And finally, you might want to get a console login remotely through IPMI. To do so, add the following line to /etc/inittab:

    T0:23:respawn:/sbin/getty -L ttyS1 57600 vt100

With all this setup, the above ipmitool sol activate line will display the same thing than the physical console on the machine, which makes it very nice to configure the BIOS, change the kernel, debug, ... Note that ~ is the control character when using the console redirection. And ~. may be used to leave the console. Also ipmitool sol deactivate may help if somebody did not leave the console correctly.

(Permanent link

Tags: ,

Dec. 27th, 2006

09:46 - Messing with the stack of PCI saved states

While testing a patch regarding the saving/restoring of MSI and PCI-Express states in the myri10ge driver, we discovered that recent changes in how the kernel saves those states result in problems with how we use it. pci_save/restore_state() are used by drivers to save the PCI registers (the configuration space) status in the host memory before suspending a device. With the addition of MSI and PCI-Express registers in recent kernel routines, the way the registers are saved has been converted to a stack. This looks fine for a normal usage: push on the stack before suspend, pop during resume. But, it is actually not fine when you save the state more often than you restore it: you push too much stuff on the stack, without ever freeing it, i.e. you leak some memory.

But, why the hell would you save the state too often? The myri10ge driver can recover from a memory parity error in the network interface. When a parity error occur, the interface resets and the drivers restores its previous state. But, we don't know when such an error will occur. Therefore, the state has to be saved in advance. Then, if you suspend your machine, the PCI layer saves the state again, which means you duplicate the saved registers on the stack as explained above.

Some patches are in the queued to balance the calls to pci_save/restore_state() in the driver so that the stack always contain a single set of saved registers. But, it might be better if the whole parity recovery process was changed, since all this looks like a mess...

(Permanent link

Tags: , ,

Sep. 20th, 2006

00:30 - Linux 2.6.18 is out with the Myri-10G Ethernet driver

Over the last 4 months, I have been sending patches to include the myri10ge driver into the Linux kernel. Linux 2.6.18 has just been released, it is the first kernel to include myri10ge.

Read more...Collapse )

(Permanent link

Tags: ,

Aug. 16th, 2006

21:19 - What HPC Networking Requires from the Linux Kernel

Since I am working on HPC-networking drivers, my company made me write an article for HPCwire's LinuxWorld Expo coverage about what problems we have to deal with the Linux kernel (it is the best OS for HPC, but there are still problems) and what support we would like to get: What HPC Networking Requires from the Linux Kernel.

As usual, some people will reply that getting the driver merged in Linux and only supporting recent kernels would make things much easier. But, we do not decide what kernel our customers want to use, so...

(Permanent link

Tags: ,

Aug. 10th, 2006

00:30 - MSI detection patches ready?

Over the last 2 months, I have been sending multiple patches to the Linux Kernel mailing list to improve the way the kernel detects whether it should enable MSI (Message Signaled Interrupts) on a device or not. The main reason for this work (apart from the fact that MSI reduces the interrupt latency from about 10 to 5 us) comes from kernels until 2.6.16 disabling MSI on _all_ devices on machines that contain a AMD 8131 chipset (which does not support MSI). The problem is that only a couple PCI devices are generally located behind this chipset, while all other devices (including all the PCI-Express ones) are not related to it at all. Hence, there was no reason to disable MSI on all these latter devices.

Read more...Collapse )

For those that are interested, the patches are available in the -mm kernel through Greg K-H's PCI patchset and will probably end up in 2.6.19.

(Permanent link

Tags: ,

Jul. 23rd, 2006

17:20 - Back from OLS 2006

I just came back from the Linux Symposium in Ottawa. It was great. Here are some talks that I enjoyed:

I didn't find the slides on-line so far, but at least the associated articles are available in the Proceedings.

A couple bad things about this symposium anyway:
(Permanent link

Tags: