Web194 currently has a failing disk. We are scheduling a replacement now and will take the machine down shortly for repair. We will update this post when we have more information.

Update [2011-07-11 23:23 UTC] – The server’s file system has entered a read-only state. We update this post when we have more information.

Update [2011-07-11 23:46 UTC] – We are bringing the server down to verify the file system so that the read-only state can be cleared. We will keep this post updated.

Update [2011-07-12 01:26 UTC] – The FSCK is currently still running.

Update [2011-07-12 02:15 UTC] – The FSCK is still running.

Update [2011-07-12 05:27 UTC] – The FSCK is still running.

Update [2011-07-12 07:05 UTC] – The FSCK is still running.

Update [2011-07-12 11:00 UTC] – Unfortunately we had to reboot the server in a rescue environment and restart a FSCK from there. FSCK is currently at 25%.

Update [2011-07-12 15:11 UTC] – FSCK is still running. It was very slow because the RAID array was being rebuilt at the same time on the machine. The RAID array is now done rebuilding so FSCK should get much faster.

Update [2011-07-12 17:11 UTC] – Unfortunately FSCK was still slow so we have decided to re-install the machine and restore all the data from backup.

Update [2011-07-12 19:11 UTC] – The operating system has been re-installed and we are in the process of setting the server up now.

Update [2011-07-12 20:36 UTC] – The server has been set up completely. We are now starting to restore customer data to the machine.

Update [2011-07-12 22:27 UTC] – We have restored all databases on the server. We are still restoring customer data to the server.

Update [2011-07-12 23:45 UTC] – We are still restoring customer data to the server.

Update [2011-07-13 01:25 UTC] – We are still restoring customer data to the server.

Update [2011-07-13 04:01 UTC] – We are still restoring customer data to the server. The first pass has finished and we are now verifying the integrity of the files.

Update [2011-07-13 05:25 UTC] – We are still restoring customer data to the server. The second pass has finished and we are now verifying the integrity of those files.

Update [2011-07-13 07:09 UTC] – Most user’s sites are now online.

Update [2011-07-13 07:28 UTC] – User logins are enabled and working.

Update [2011-07-13 08:42 UTC] – The server is now back to normal. In 8 years of business this is the first time we have such a long downtime on a server and we would like to apologize for that. The problem was a combination of a corrupted filesystem, a degraded RAID array and FSCK taking many times longer than usual. We will update our procedures to greatly reduce the downtime if this happens again: we’ll run FSCK before trying to rebuild the array (rebuilding the array can be done once the server is back online) and if FSCK is taking too long we’ll stop it much sooner and we’ll start re-installing the server and restoring the data from backups straight away.