Service interruption August 10 2016 (resolved)

Between 8:52 and 9:03 AM Pacific time today (August 10, 2016), some sites we host experienced an interruption of service. The problem is resolved, and will not recur.

This happened because our network has been experiencing a “distributed denial of service attack” against the telnet service that we’ve been successfully blocking. However, when we deployed a change to make this blocking more effective, it caused some servers to run low on memory, which caused extremely high loads for a few minutes.

This problem was caused by a one-time event that will not happen again. We sincerely apologize for the inconvenience this problem caused.

Technical details

Our servers, like many on the Internet, are being attacked by the LizardStresser botnet. This botnet is annoying because although our servers are not vulnerable to the security problem it tries to exploit, the sheer rate of telnet connections it attempts (dozens per second on some servers) exceeds the usual rate limits we have on telnet, making telnet unreliable for legitimate customers. (Nobody should really be using telnet at all — you should always use SSH — and we have plans to phase out telnet completely, but that’s a separate issue.)

We have a list of 163,736 IP addresses in this botnet. To block them from accessing telnet at all, we loaded all of these into the firewall on each server using ipsets. This works. However, on some of our servers that have 64 GB of memory, the act of loading them caused the server to completely empty out the Linux disk cache — more than 40 GB of it. This was surprising; we haven’t seen this happen before in tests. That causes the server to run much, much more slowly for a few minutes while it fills in the cache, re-loading that 40 GB of data from disks.