[Fixed] Scheduled Maintenance on Web119, 26 July 2012.

Posted in Downtime by

At 13:00 UTC 26 July 2012 we’ll be taking the server offline for some scheduled maintenance. We don’t anticipate the downtime to be more than 1 hour. We’ll update the post as the maintenance progresses.

2012-07-26 14:05 UTC: Web119 is back up and functioning properly.

-
-

[Fixed] Emergency maintenance on Web119, July 25 2012

Posted in Downtime by

Web119’s file system went read-only. We are currently running fsck to bring the server back up.

2012-07-25 14:40 UTC: Web119 is back up and functioning properly.

-
-

[Done]Emergency maintenance on Web106, July 25 2012

Posted in Downtime by

Web106’s file system went read-only. We are currently running fsck to bring the server back up.

2012-07-25 15:39 UTC: Web106 is back up and functioning properly.

2012-07-25 16:58 UTC: The filesystem has gone read-only again. We’re working to resolve the issue.

2012-07-25 18:21 UTC: The filesystem check is complete, and we’re running other hardware diagnostic tests at this time.

2012-07-25 20:33 UTC: The filesystem continues to go into a read-only state, even after successful checks. Since the disks seem to be OK, we’re arranging for a full chassis swap at this time.

2012-07-25 22:34 UTC: The chassis has been swapped. The filesystem still contains errors after the chassis swap; we’re running a filesystem check.

2012-07-26 12:04 UTC: The filesystem check is still in progress, 45% complete.

2012-07-26 01:20 UTC: The filesystem check is still in progress, 80.5% complete.

2012-07-26 03:25 UTC: The file system check is finished and the file system seems to be stable. We’re now working to bring the server back up on the network.

2012-07-26 04:26 UTC: The server is now back online. We’ll continue to monitor the server closely to make sure that no additional filesystem, hardware, or network errors are left unresolved.

2012-07-26 07:13 UTC: The server is down again; we suspect drive failure in one of the RAID disks which is causing the read-only condition. We need to fsck, backup, and replace that drive.

2012-07-26 10:14:41 UTC: There were problems getting the server into the rescue mode but the fsck has started now.

2012-07-26 13:48 UTC FSCK completed, RAID firmware upgraded and now rebuilding. The server is back to operational status. We will keep monitoring this machine closely.

-
-

[Done]Emergency maintenance on Web238, July 22 2012

Posted in Downtime by

Web238 is currently down for emergency maintenance. We’ll continue to update this post with additional information as it is available.

2012-07-22 20:40:34 UTC: We are running fsck as it went read only to prevent any data corruption.

2012-07-22 22:19 UTC: The file system was unable to be recovered after numerous file system corruptions that were unable to be re-paired. We’re now work working to recover the data from the /home directory and any other data that is not corrupt.

2012-07-22 23:47 UTC: We are still working to recover all available data from the corrupted filesystem. We’ve been able to recover almost all of the data stored in the /home directory.

2012-07-22 00:45 UTC: The data recovery has finished. We’re now re-installing the operating system.

2012-07-22 01:45 UTC: The operating system has been re-installed and we are now installing our platform and tools.

2012-07-22 02:46 UTC: We’re now syncing the saved data back to the machine.

2012-07-22 04:19 UTC: The sync of user data back to the machine has completed. We are now working on restoring the MySQL and PostgreSQL databases.

2012-07-22 05:27 UTC: We are still restoring the MySQL and PostgreSQL databases.

2012-07-22 05:55 UTC: The databases have been restored and the server is now back online and working normally. We’ll continue to monitor the server to verify that there are no further problems.

-
-

[Fixed]Web164 is offline

Posted in Downtime by

Web164 is currently offline. We are investigating the issue.

2012-07-19 09:13 UTC: Web164 is up and running again.

2012-07-19 09:30:13 UTC: We are taking the server down again for a fsck to prevent any data corruption as it had gone read-only.

2012-07-19 09:41:45 UTC: The fsck is running and the first pass is at 15%.

2012-07-19 09:54:14 UTC: The first pass is at 65%.

2012-07-19 10:16:33 UTC: The first pass is over and the second pass is at 80%.

2012-07-19 10:31:50 UTC: The fsck is over and the server has been rebooted, it is ok now.

-
-

[Fixed] Network outage affecting multiple servers

Posted in Downtime by

A network outage is currently affecting the following servers: dweb74, dweb76, dweb77, dweb78, mailbox7.

We’re working with our upstream provider to resolve this issue and hope to have service restored ASAP.

We’ll update this post when we have more information.

2012-07-18 21:11 UTC: Network connectivity has been restored.

-
-

[Fixed] Web216 is down

Posted in Downtime by

Web216 is currently down. We are investigating.

2012-07-18 19:33 UTC: Web216 is back online. There was a power failure that brought the server down.

 

-
-

[Done]Degraded RAID on Web176 causing elevated load

Posted in Scheduled downtime by

The disk array on web176 is currently rebuilding to correct a degraded state. This operation will take several hours, during which system load will be elevated, causing slower server performance.

We’ll update this post when the rebuild is complete.

2012-07-19 The server will be taken offline for a disk drive replacement Friday July 20th between 07:00 UTC and10:00 UTC. We will update this post as maintenance progresses.

2012-07-19 07:35 UTC The disk has been replaced and it’s currently rebuilding. The server is back at operational status.

-
-

[Fixed] Web176 Offline

Posted in Downtime by

Web176 is currently offline. We are investigating the issue.

2012-07-17 18:37 UTC: There was a hardware failure, we are working to get the repair completed as quickly as possible.

2012-07-17 19:11 UTC: Hardware issue has been resolved and the server is back online.

-
-

[Done] RAID firmware upgrade on Web237 on July 16th at 2pm UTC

Posted in Downtime by

Web237 will be taken offline on July 16th at 2pm UTC for a RAID firmware upgrade. The downtime should be less than 1h.

[Update 14:19 UTC] The upgrade only took 10mins and the server is back online

-
-