On Tuesday 17th Feb 2009 at 9am CST (3pm GMT) we will be performing some essential network maintenance which will involve replacing one of our network switches. All of the servers connected to this switch will briefly lose network connectivity. Downtime should be minimal, no more than a few minutes. The servers effected by this outage are:
POP, IMAP and Webmail services hosted on mail.webfaction.com (But not outbound email)
We’ve been experiencing intermittent high loads on our mail platform which have rendered some mail services slow.
In the past few hours we have added some more mail servers to our platform. We are currently working on spreading the load across these new servers so the speed of mail services should improve soon.
Update (Feb 9th, 10.30pm GMT): We have started spreading mailboxes across our new mail servers but the load remains high on the platform at the moment. We are continuing to spread mailboxes to new servers.
Update (Feb 10th, 5pm GMT): The load on the mail platform remains high. We are still moving mail accounts to fresh new mail servers but the process is taking a while because the servers we are moving them from are overloaded. We also have some more e-mail servers on their way to make sure that these load problems won’t happen again once we have the load under control.
Update (Feb 11th, 4.30pm GMT): We have now migrated a good number of accounts to our new mail servers and the load on the platform is now much better. We will continue to migrate more accounts.
Update (Feb 23rd, 6.40pm GMT): The load across our mail platform has now been low for over a week and all services have been responding quickly during that time so we’re marking this issue as resolved.
We’ve had some issues on these servers since yesterday. Basically, a misconfiguration (2 characters to be exact) in our memory watchdog script caused it to kill some root processes it shouldn’t have. This means that we had some SSH issues, DNS issues and databases issues on these servers (basically, these services were dying regularly and getting restarted later).
Fortunately we’ve been able to track down the problem and everything should be back to normal now.