Brief service interruption on web11 server (resolved)

Between 1:53 AM Pacific time and 2:09 AM on May 1, the disk load on the “web11” server became very slow, requiring that server to be restarted. We did so, and normal service was resumed at 2:10 AM. Other servers were not affected.

Read the rest of this entry »

Network maintenance Saturday March 24 (completed)

We’ve been notified by an upstream network provider that they will be performing router firmware upgrades on Saturday, March 24 2012 between 4:00 and 4:30 PM Pacific time.

Most customers will not notice any service interruption because we use redundant network providers, but in the worst case it can take up to about 90 seconds for certain parts of the Internet to see the changed “routes”. That means a brief interruption is theoretically possible for some connections. We’re announcing this just so you know that if you do see any problem, it will be resolved quickly.

Update 4:33 PM Pacific time: The maintenance has been completed.

Brief scheduled maintenance for MySQL update March 9, 2012 (completed)

Between 10:00 PM and 11:00 PM Pacific time on Friday March 9, 2012, we’ll be updating the MySQL database software on all our hosting servers. This will cause a Web site service interruption of about 30 seconds for some customers at some time during this period. E-mail will not be affected.

This maintenance is necessary to install a mandatory MySQL security update that will upgrade the MySQL version to 5.1.61. We apologize for any inconvenience this causes.

Update 10:13 PM: The maintenance was completed with less than 30 seconds downtime on each server. Customers should not notice any changes, but as always, don’t hesitate to contact us with any questions or problems.

Problem on web03 server (resolved)

Web sites on the web03 server suffered an interruption in service between 7:32 AM and 7:45 AM this morning (Tuesday, February 21).

This was caused by a “hung” process that prevented a routine Apache Web server reload from completing. Other servers were not affected. Our staff restarted the server to stop the “hung” process, and the problem was resolved.

We sincerely apologize to customers affected by this incident. We’re considering possible underlying causes to prevent a recurrence.

Brief scheduled maintenance February 18, 2012 (completed)

On Saturday, February 18, 2012 between 10:00 and 11:00 PM Pacific time, we’ll be upgrading the Apache Web server software on each of our Web servers.

Most customers will not notice anything, but the upgrade will cause approximately 30 seconds of slow Web page loading at some point during that hour as we delay incoming connections at the network level.

This maintenance is necessary to apply security and reliability fixes released by the Apache developers. (We’ve been using the upgraded version on our Webmail servers for several days, so it’s well tested.)

Update: The maintenance was completed at 10:03 PM Pacific time.

web05 server high load (resolved)

The disk load on the “web05” server was extremely high between 2:30 and 2:42 AM Pacific time Saturday February 4, causing some downtime during that period for sites using that server. Other servers were not affected.

Read the rest of this entry »

Stability improvements for a server memory problem

A couple of days ago, one of our Web servers became unstable for an unknown reason and needed to be restarted. This is rare: on average, this happens less than once every five years of uptime per server, so we took it very seriously and launched an investigation.

What we found was that the owner of one of the sites on that server made a mistake that allowed attackers to run their own scripts. That’s all too common, unfortunately, but usually only the single site is affected by this kind of thing. What was surprising in this case was that the script used a previously unknown method of causing problems for other sites running on the server.

As a result of this investigation, we’ve made several changes to our systems to ensure the problem won’t recur. The rest of this post has a detailed technical description of the problem in case it’s useful for others.

Read the rest of this entry »

web07 server restart on February 1, 1012 (resolved)

Our “web07” server needed restarting at 11:36 AM Pacific time on February 1, 2012, because it had been intermittently unable to run some PHP scripts for 22 minutes.

The restart resolved the immediate problem, and a followup post explains what happened and the changes we made to prevent it from happening again.

Comcast routing problems December 16 2011 (resolved)

Update: The problems described below were resolved by Comcast around 11:00 AM Pacific time and have not recurred since. We’re cautiously marking this issue closed, but continuing to monitor it.

We’ve received scattered reports of high “packet loss” to a few Comcast locations (but not most). Packet loss can cause pages to load slowly in some cases.

Read the rest of this entry »

Outage December 12, 2011 (resolved)

Between 5:34 PM and 6:10 PM Pacific time December 12, many customers experienced a complete outage of their sites (and of our own www.tigertech.net and mail.tigertech.net sites).

This was caused by the failure of a hardware Ethernet switch in one of our server cabinets, cutting off all access to the servers that plug into it. The Ethernet switch began working after being physically unplugged and plugged in again, but since we do not know why it failed, it will be completely replaced tonight as a result of this incident.

This is the same model of Ethernet switch that we’ve been using in all our cabinets for years, so we don’t believe it is a general problem with the hardware in question.

We sincerely apologize for this incident. We take reliability seriously, and we don’t consider it acceptable.

Update 1:20 AM: The failed Ethernet switch was replaced with no further downtime.