[Done]Emergency maintenance on Web39 Friday, 29 June

Posted in Downtime by

Web39 has been taken down for an emergency hard drive replacement. We’ll update this post as maintenance progresses.

2012-06-29 20:38 UTC: The server is now back online and the drive’s RAID is rebuilding. We’ll continue to monitor its progress.

2012-06-29 21:05 UTC:  The RAID rebuild was not progressing as it should have. We’ve rebooted the server to verify that the controller is seeing the new hard drive correctly.

2012-06-30 09:20 UTC:  We’ve rebooted the server as the RAID rebuild was stuck again.

2012-06-30 10:10 UTC:  The reboot did not solve the problem with the RAID rebuild and we are working with the onsite engineers to find out the exact problem.

2012-06-30 17:56 UTC: After swapping the new hard drive with another drive the server’s RAID was able to be rebuilt correctly.

-
-

[Done]Scheduled maintenance on Web200, July 2nd 2012.

Posted in Scheduled downtime by

Web200 will be taken down Monday July 2nd 2012 at 13:00 UTC for a RAID cache replacement. We will update this post as maintenance progresses.

2012-07-02 13:20 UTC The RAID cache has been replaced. The server is back at operational status.

-
-

[Done]Scheduled maintenance on Web114 June 29th 2012.

Posted in Scheduled downtime by

Web114 will be taken down for a RAID cache swap Friday June 29th 2012 at 14:00 UTC. We will update this post as maintenance progresses.

2012-06-29 14:27 UTC RAID cache has been swapped. The server is back at operational status.

-
-

[Done]Emergency maintenance on Web300, June 27th 2012.

Posted in Downtime by

Web300 is currently undergoing an FSCK following an unscheduled reboot. We will update this post with status periodically.

2012-06-27 09:36 UTC The fsck is still running.

2012-06-27 11:35 UTC Fsck completed. The server is back at operational status.

-
-

[Done]Scheduled Maintenance on Web95 June 27th 2012

Posted in Scheduled downtime by

Web95 will be taken down for a RAID cache module swap Wednesday June 27th 2012 at 14:00 UTC. We will update this post as maintenance progresses.

2012-06-27 14:32 UTC The RAID cache has been swapped. The server is back at operational status.

-
-

Scheduled Maintenance on Dweb29 June 27th 2012.

Posted in Scheduled downtime by

Dweb29 will be taken down for a disk drive replacement Wednesday June 27th between 07:00 UTC and 11:00 UTC. We will update this post as maintenance progresses.

-
-

[Done]Emergency maintenance on Web226

Posted in Downtime by

Web226 has been taken offline for emergency maintenance. The server was experiencing severely high loads which our monitors suggest may be related to an intermittent hardware failure (RAM or motherboard-related), and we are investigating this now.

2012-06-22 3:09 UTC: The server is back online temporarily, but we will be taking it offline again for further investigation.

2012-06-21 05:34 UTC: The server has now been taken offline to replace the RAM.

2012-06-21 05:57 UTC: After replacing the failed RAM module we’ve found several other RAM modules that were not reporting as failed before the replacement. We’re now working to bring the server back offline to replace all of the RAM modules.

2012-06-22 06:08 UTC: The server is now back online and functioning normally.

-
-

[Fixed] Web226 down

Posted in Downtime by

Web226 stopped responding several minutes ago. We are currently working to restore service and will update this post when we have more information.

2012-06-20 21:22 UTC: Web226 is back online and we are troubleshooting a recurring load spike at this time,

2012-06-21 03:09 UTC: Web226 has been stable since we brought it back online.

-
-

[Done]Emergency maintenance on Web99

Posted in Downtime by

Web99 is being taken offline for emergency maintenance (disk cache replacement). We’ll update this blog post with more information as the maintenance progresses.

2012-06-14 19:22 UTC: We’ve  now taken the server offline and the disk cache is being replaced.

2012-06-14 19:44 UTC: The server is now back online. We’re verifying that the cache replacement has been successful.

2012-06-14 21:07 UTC: The server is still online and all of our tests have passed.

-
-

[Done]Emergency maintenance on Web65 Thursday, 14 June

Posted in Downtime by

Web65 has been taken offline for emergency maintenance. The server was experiencing severely high loads and not utilizing any SWAP space despite the OOM killer running. We’re investigating for hardware problems now.

2012-06-13 02:34 UTC: The server is back online but our testing and investigation isn’t finished so the server may go offline multiple times before the maintenance is complete.

2012-06-13 03:06 UTC: We’ve taken the server offline again to replace all of the RAM in the server to rule out RAM being a problem.

2012-06-13 03:43 UTC: The server is back online now and we’re closely monitoring the server to verify that the RAM swap has fixed the problems we were seeing.

2012-06-13 04:00 UTC: We’ll continue to monitor the server closely throughout the night but it looks like the hardware problems we found have been fixed with the RAM swap.

2012-06-13 04:32 UTC: The problems seem to have returned and we have had to reboot the server. We are still monitoring the server to find out the exact cause.

2012-06-14 12:49 UTC: The problem was with one of our backup subsystems. We have corrected it now and the server is stable.

-
-