Stability improvements for a server memory problem

A couple of days ago, one of our Web servers became unstable for an unknown reason and needed to be restarted. This is rare: on average, this happens less than once every five years of uptime per server, so we took it very seriously and launched an investigation.

What we found was that the owner of one of the sites on that server made a mistake that allowed attackers to run their own scripts. That’s all too common, unfortunately, but usually only the single site is affected by this kind of thing. What was surprising in this case was that the script used a previously unknown method of causing problems for other sites running on the server.

As a result of this investigation, we’ve made several changes to our systems to ensure the problem won’t recur. The rest of this post has a detailed technical description of the problem in case it’s useful for others.

Technical details

Over a year ago, one of our customers installed a script called “vBSEO” on his site. Unfortunately, that script had a known programming error that allowed strangers to run any software as if they were the site owner, and the customer didn’t update the script promptly. An attacker took advantage of that bug to run some malicious software.

Note that this isn’t due to any kind of security flaw in our systems, and the attacker didn’t gain any access to any other sites, passwords, or anything else. All that happened is that the attacker was able to run some software that the real site owner could have run (but wouldn’t have done). Finding flaws in vBSEO is, in one sense, a waste of the attacker’s time, because he could have simply opened an account with us and run the same software. Any of our customers could do it.

Because of that, one of the important features of our service is that we isolate each site from other sites. It should be impossible for one site owner to do something that affects others. In this case, though, the malicious software used up general server memory in a way that bypassed our normal per-site memory restrictions. That had the effect of making a small number of PHP scripts on the server fail to start, generating “Internal Server Error” messages instead.

What the malicious software actually did was create over 4000 small “shared memory segments“, filling up the entire 4,096 available shared memory slots on the server. It appears that unlike some other operating systems, Linux has no way to prevent one process from doing this. (Solaris, for example, allows you to set this by altering SHMSEG, and it defaults to allowing just 6 segments per process.) In fact, this single, trivial line of PHP code allows any user to fill up all the slots on a Linux server:

while (shmop_open(NULL, "c", 0644, 100)) {}

Once the shared memory slots are all taken, any other process on the server that tries to use shared memory will fail until a new slot opens up. This means that PHP with eAccelerator wasn’t able to successfully run some scripts (which set off immediate alarms in our monitoring systems).

It’s interesting that such a trivial attack from a non-privileged user breaks the widely used eAccelerator (and other) software. We can’t find any other discussion of this on the Internet, and it’s likely that many other companies are vulnerable, too.

Changes to solve the problem

After discovering the exact cause of this problem, we took three steps to make sure it can’t happen again:

  1. We’ve added a mod_security rule to block this attack on vBSEO, even if our customers haven’t updated their vBSEO software. (This rule has prevented attacks on six other sites we host since we added it — it’s a widespread attack.)
  2. Our monitoring systems now check for the “out of shared memory slots” situation and automatically fix it by deleting rogue segments.
  3. Most importantly, we’ve changed how eAccelerator works on our systems so that it doesn’t require a shared memory segment. Even if someone does fill up the shared memory slots, eAccelerator will no longer fail, so it won’t cause noticeable problems for customers.

We’ll also notify the eAccelerator folks of this incident in case they want to change the default way it works, avoiding this problem for others.