Avoiding a Linux kernel 2.6.26 cgroup bug

We recently had a server that twice “crashed” and needed manually restarting. We’ve identified the cause of that problem — an apparent bug in Linux kernel version 2.6.26 — and made some changes to ensure that it doesn’t affect our customers again.

However, we didn’t find any information about this problem when searching the Internet, so we’re describing the details here in the hope that it helps someone else.

On our servers, we use Linux kernel version 2.6.24 or 2.6.26. There’s a reason we use these fairly modern versions: they include the Linux Completely Fair Scheduler to better allocate available CPU time to Web site scripts.

It’s worth talking a little about the advantages of the Completely Fair Scheduler for shared hosting servers. Traditionally, Linux has allocated CPU power on the basis of the number of tasks (programs) a user is running. If you have four users (call them Alice, Bob, Carol and Dave) each running a single PHP script, each of the scripts will get 25% of the available CPU power.

But if Alice’s site goes haywire and starts running 97 simultaneous PHP scripts, while the other three users continue to use a single PHP script, there’s a problem. Linux will dole out 1% of the available CPU power to each of the hundred scripts running. So Alice ends up using 97% of the server’s CPU resources, while Bob, Carol and Dave each only get 1%, making their scripts run 25 times more slowly. This is a common problem in shared hosting, and it leads to (rightly) unhappy customers.

Completely Fair Scheduling (CFS), introduced with Linux kernel 2.6.23, fixes this. It allows CPU power to be allocated in different ways. In particular, it makes it easy to allocate CPU power by users, instead of by processes.

In the above example, CFS can assign each user 25% of the CPU resources. Alice will find her scripts running very slowly, because all 97 of them have to share 25% of the CPU power (giving each script only 0.26% of the server’s total CPU). But Bob, Carol and Dave won’t notice a thing: their scripts will each get 25% of the CPU power, running just as quickly as they ever did.

We started using CFS last year when the Debian GNU/Linux Etch-and-a-half release included kernel version 2.6.24. Since then, the “one user goes haywire and causes everything else on a server to be slow” problem has pretty much been eliminated.

Kernel 2.6.24 included a version of CFS that behaves just like we described above, doling out CPU power equally to each user. Although we can tweak the settings a little to allocate more CPU power to a certain user if necessary, it’s not particularly flexible (which is fine by us).

Which brings us to kernel 2.6.26. We’ve recently started buying more powerful servers (dual Quad-core Xeon CPUs, 8 GB of RAM, three-disk RAID 1 arrays) that require kernel 2.6.26 or later because of the new network cards they use.

However, CFS in Debian’s kernel 2.6.26 is more complicated, allowing new ways of grouping tasks using something called group scheduling (aka cgroups). This system requires processes to be manually assigned to a group. So we wrote a utility that notices when new scripts are launched by the Apache Web server, for example, and assigns that process to a group for the appropriate user.

To do this, it writes the process ID to a “tasks” file in the “sysfs” file system. It does this tens of thousands of times per day, and we’ve had no problems. However, one of our customers recently started running many more scripts, causing the code to run many hundreds of thousands of times a day, and since then we’ve twice seen a horrible problem that seems to happen about once every million writes. The Linux kernel locks up while processing the write, according to the stack backtrace:


WARNING: at lib/kref.c:43 kref_get+0x17/0x1c()
Pid: 4348, comm: cgroups-assigne Not tainted 2.6.26-bpo.1-686-bigmem #1
[] warn_on_slowpath+0x40/0x66
[] proc_pident_instantiate+0x74/0x84
[] __d_lookup+0x96/0xd5
[
] __d_lookup+0x96/0xd5
[] do_lookup+0x53/0x153
[] notify_change+0x2b0/0x2c0
[] kref_get+0x17/0x1c
[] cgroup_attach_task+0x12a/0x3cf
[] cgroup_common_file_write+0x127/0x19b
[] cgroup_common_file_write+0x0/0x19b
[] cgroup_file_write+0x41/0x115
[] sys_fstat64+0x1e/0x23
[] security_file_permission+0xc/0xd
[] rw_verify_area+0x83/0xa2
[] cgroup_file_write+0x0/0x115
[] vfs_write+0x83/0x120
[] sys_write+0x3c/0x63
[] sysenter_past_esp+0x78/0xb1

(Apologies for the technical details, but that’s the point of this post, really — to help someone else having the same problem find the information above in a Web search.)

This is almost certainly a bug in the cgroup code in kernel 2.6.26. Much of the cgroup code seems to have been rewritten for newer versions of the kernel, so in theory we could use one of those — but for security reasons, we try very hard to only use kernels provided by the Debian folks.

Because of that, we’ve stopped using cgroup-based CFS on 2.6.26 servers, so we’re confident that the crash won’t happen again. However, using CFS helps our customers, so we need a better long-term solution.

It turns out it’s possible to recompile kernel 2.6.26 to use the simpler, non-cgroup-based CFS that was the default in 2.6.24. We’ve done that, and we’ve installed the new version on some test servers (not live hosting servers yet). Assuming it continues working correctly, we’ll install the custom kernel on hosting servers, too; look for an announcement about that later this week.

4 Comments

  1. Hi! just wondering if this issue has been fixed in kernel code?

    Thanks!

  2. >just wondering if this issue has been fixed in kernel code?

    No idea, unfortunately. We only use Debian stable kernels, so we haven’t actually tried anything beyond 2.6.26.

    We did look at later kernels and saw that much of the CFS code has been completely rewritten, so the chances are that the bug is gone. But since we don’t know what lines of code caused the problem, we can’t say for sure.

  3. Thanks for info!

    I am also using Debian stable and tried to search more info how to configure cgroups. Right now I am running with home server, but want to learn because might need this later in business class servers.

  4. I tried with kernel 2.6.31. Problem is that loads are getting higher than 1.00 with 1-cpu machine and following test.

    I think this doesnt limit cpu-usage, only priorizes usage according info what we feed. I dont see anything new compared to “nice” -command.

    #!/bin/bash

    echo 54 > /dev/cpuctl/highprio/cpu.shares
    echo 46 > /dev/cpuctl/lowprio/cpu.shares
    screen -fa -d -m -S K6 burnK6
    piddi=`pidof burnK6 2>/dev/null`
    echo K6 $piddi
    echo $piddi >/dev/cpuctl/lowprio/tasks

    screen -fa -d -m -S K7 burnK7
    piddi=`pidof burnK7 2>/dev/null`
    echo K7 $piddi
    echo $piddi >/dev/cpuctl/highprio/tasks

    #killall burnK6
    #killall burnK7

    If I run only one burnK7 with cpu.shares of 1 compared to normal processes (1024?) burnK7 still takes all the cpu-power.