Web 69 is currently down. Its root partition went read only and rebooting it revealed an issue which we are currently working on resolving now. We will post updates as they are available.
It appears to be an issue with the RAID controller and we are currently replacing the hardware and restoring all data from backup.
2009-03-09 06:00 PST: Web69 is still down having suffered a serious RAID controller failure. We have recovered all of the data from the server and are currently working on restoring it to a new standby server which will replace web69.
2009-03-09 06:16 PST: Web69 is now back online with all its data. We decided to move the data onto a new server to give us more time to check the hardware on the failing server. We copied all the data from just before the crash so no data has been lost.
We will be investigating the cause of an audible alarm on Web42 tomorrow (Feb 25th) at 9am GMT.
Depending on the severity we might have to take the server down and if so it will be down between a few minutes and a few hours. We will update this ticket as soon as we have more information tomorrow.
Update: the problem was a degraded RAID. We replaced one of the drives and the server is now back online and the RAID is rebuilding.
Mail services on mail5.webfaction.com and webmail.webfaction.com are currently not working. We are looking into the problem and hope to have normal service restored soon. We will update this entry as we have more information.
2009-02-20 13:46 CST The Mail5/Webmail server has a disk problem. Repairs are now in progress.
2009-02-20 14:17 Repairs on Mail5 are still in progress. We have pointed ‘webmail.webfaction.com’ to a different mail server, so as soon as that DNS change propagates you’ll be able to access the webmail system (unless your mailbox resides on mail5). Your existing webmail address book and preferences will not be available, since they are stored on the server that is currently having problems.
2009-02-20 14:48 Mail5 is back online. The webmail system is still running on the alternate server.
2009-02-20 15:04 webmail.webfaction.com is pointing at the original server, so webmail users should now have access to their address books and preferences.
On Tuesday 17th Feb 2009 at 9am CST (3pm GMT) we will be performing some essential network maintenance which will involve replacing one of our network switches. All of the servers connected to this switch will briefly lose network connectivity. Downtime should be minimal, no more than a few minutes. The servers effected by this outage are:
POP, IMAP and Webmail services hosted on mail.webfaction.com (But not outbound email)
We’ve been experiencing intermittent high loads on our mail platform which have rendered some mail services slow.
In the past few hours we have added some more mail servers to our platform. We are currently working on spreading the load across these new servers so the speed of mail services should improve soon.
Update (Feb 9th, 10.30pm GMT): We have started spreading mailboxes across our new mail servers but the load remains high on the platform at the moment. We are continuing to spread mailboxes to new servers.
Update (Feb 10th, 5pm GMT): The load on the mail platform remains high. We are still moving mail accounts to fresh new mail servers but the process is taking a while because the servers we are moving them from are overloaded. We also have some more e-mail servers on their way to make sure that these load problems won’t happen again once we have the load under control.
Update (Feb 11th, 4.30pm GMT): We have now migrated a good number of accounts to our new mail servers and the load on the platform is now much better. We will continue to migrate more accounts.
Update (Feb 23rd, 6.40pm GMT): The load across our mail platform has now been low for over a week and all services have been responding quickly during that time so we’re marking this issue as resolved.
We’ve had some issues on these servers since yesterday. Basically, a misconfiguration (2 characters to be exact) in our memory watchdog script caused it to kill some root processes it shouldn’t have. This means that we had some SSH issues, DNS issues and databases issues on these servers (basically, these services were dying regularly and getting restarted later).
Fortunately we’ve been able to track down the problem and everything should be back to normal now.