Web29 will be taken offline for a scheduled hard drive replacement on Wednesday, August 24th at 22:00 UTC. The downtime should not be longer than 1 hour. We will update this post as maintenance progresses.
Update – [04:19 UTC 2011-08-25]: The server’s drive has been replaced and the server has been back online for approximately 3 hours and is functioning normally.
Web139 will be taken offline for a scheduled hard drive replacement on Wednesday, August 24th at 06:00 UTC. The downtime should not be longer than 1 hour. We will update this post as maintenance progresses.
[07:03 UTC] The drive has been replaced and the server is back to normal operations.
We are aware of an issue affecting Web90, Web141, Web162 & Web166 and are working to resolve it ASAP. We will post more information here as it develops.
Update [2011-08-20 09:25 UTC] – Here is the post mortem and follow up for this issue:
Over the last 36 hours Web90, Web141, Web162, and Web166 have repeatedly gone offline or become unreachable.
From the first outage we’ve been working with the data center to determine the exact cause of the issue. With high network usage cited as the cause of the outage we watched the servers carefully over the remainder of the day.
When the servers became unstable again we began scrutinizing the outbound traffic from the servers both from the data center and our own monitoring tools on the server. Initial causes seemed to be UDP packets that were flooding the connection.
After disabling specific UDP packets with no change we began to look deeper into what the cause was. After a few hours both the data center and our system administrators found that the cause was fragmented IP protocol packets that were flooding the outbound connection on the servers.
These fragmented IP packets were not being picked up through normal monitoring channels because they weren’t considered valid packets by the monitoring software. With the issue found we began to trace it back to its root cause which was the WordPress exploit we tweeted about earlier:
Of all the WordPress sites we host the only ones hit with this thumb.php exploit, to this extent, were on Web166 and Web90. Since these machines were in the same racks as our other servers the excess bandwidth over saturated the connection and caused outages on the entire network segment.
We immediately began finding vulnerable WordPress themes and plugins that used the thumb.php and timthumb.php files and sending messages to the owners of the sites informing them of the issue and a fix.
Since we began that process the servers have been online and we are monitoring them very closely to insure that no other vulnerable WordPress sites can be exploited. The server security was never compromised because of the way our users and ACLs are set up and the exploits were run as the user like all PHP scripts are.
Over the next few days we will be looking for this, specific, vulnerabilities over all servers and notifying those users
Web25 will be taken offline for a scheduled RAID card maintenance on Sunday, August 21th at 18:00 UTC. The downtime should not be longer than 4 hours. We will update this post as maintenance progresses.
2011-08-21 18:27 UTC: Web25 is down for maintenance.
2011-08-21 18:52 UTC: Maintenance has been completed and Web25 is back online.
We are aware of an issue on Web131, Web90, Web108, Web166, Web141, & Web162 and are working to resolve it ASAP. We will post more information here as it develops.
[Update 01:00 UTC] Web-servers Web131, Web166, Web162 and Web108 are back on-line. Web90 is in the process of FSCK and we are working to bring Web141 online.
[Update 01:15 UTC] Web-servers Web141 and Web90 are back on-line.
[Done 01:20 UTC] All web-servers appear back online, if anyone has any remaining issues please let us know in a support ticket. The issues were caused by internal network latency.
[Update 04:37 UTC] The same servers are back off-line. We are working to bring them back on-line now and are performing further investigation. We will update the status blog as soon as we have more information.
[Done 06:52 UTC] The servers have been online and we have been monitoring them in real-time. The issue appears to have been high network use which occurred rapidly, thus bringing the systems down without warning. We are tracking the root causes down now.
Web25 will be taken offline for a scheduled hard drive replacement on Tuesday, August 18th at 12:00 UTC the downtime should not be longer than 1 hour. We will update this post as maintenance progresses.
Update [2011-08-18 13:20 p.m. UTC]: The server is now back online and responding to requests.
The /home2 filesystem on Web7 has gone into a read-only state. We’re going to take that machine down momentarily to troubleshoot and repair the problem.
Update [2011-08-15 08:11 p.m. UTC]: We are performing an filesystem verification on the server. We will update this post as the progress continues.
Update [2011-08-15 08:37 p.m. UTC]: The server is now back online and responding to requests.