Archives for the 'Downtime' category
[Resolved] Networking Problem
Updated Sep 2 at 10:20 CDT (first posted Sep 1 at 13:55 CDT) by David Sissitka in Downtime - 0 comment(s)
The Planet appears to be having a networking problem that's making some servers inaccessible to some people. We've contacted The Planet re the problem and we'll post updates as we receive them.
Update (02:07 PM CDT): The Planet is currently working on their DNS servers to resolve an issue with intermittent nameserver resolution. They didn't have an ETA as to when this would be done for us.
Read the full article and comments
Web4 rebooting (Fixed)
Posted Aug 6 at 08:20 CDT by Sean in Downtime, Problems - 0 comment(s)
The filesystem on Web4 went read-only a few moments ago. We are rebooting that server now and hope to have normal service restored soon.
Update 2008-08-06 08:33: Web4 is back online.
Read the full article and comments
Help system and Mail3 down (fixed)
Posted Aug 4 at 11:56 CDT by Sean in Downtime, Problems - 0 comment(s)
The help system and mail3 are currently down. We're currently rebooting that server and hope to have normal service restored soon.
Update: 2008-08-04 12:01 - Mail3 is back online.
Update: 2008-08-04 15:22 - IMAP and SSH stopped responding on mail3, so we had to reboot it once more.
Read the full article and comments
Web35 Down (Fixed)
Posted Aug 2 at 20:52 CDT by Richard in Downtime, Problems - 0 comment(s)
The filesystem on Web35 started having problems a few minutes ago. We are rebooting Web35 now, and will perform a filesystem check before bringing the server back online. We'll update this entry when we have new information.
Update 2008-08-02 21:01: The server is back up now.
Read the full article and comments
Web13 down (fixed)
Posted Jul 28 at 04:44 CDT by Remi in Downtime - 0 comment(s)
Web13 is currently down for an unknown reason. We'll update this post once we have more information.
Update: Web13 is now back online. The problem was caused by a misconfiguration of the network interface on the machine.
Read the full article and comments
Network issues at ThePlanet affecting multiple servers (fixed)
Updated Jun 11 at 05:48 CDT (first posted Jun 6 at 10:00 CDT) by Sean in Downtime - 0 comment(s)
An apparent outage at ThePlanet's H1 data center is currently affecting multiple WebFaction servers. We will update this entry when we have more information
From ThePlanet: June 6 10:00am CDT We have lost network connectivity to H1. We are confirming the extent of any power loss, and we will be updating shortly.
From ThePlanet: June 6 10:05am CDT Transport for H1 temporarily fell offline and is restored. H1 Phase 2 did not lose power. H1 Phase 1 lost power. We will be updating again shortly.
From ThePlanet: June 6 10:10am CDT The temporary generator powering Phase 1 failed. We switched over to the backup generators that were just brought in. The CRAC units have been powered on, and PDUs are having power restored right now.
From ThePlanet: June 6 10:15am CDT We continue to power PDUs in Phase 1. Customer servers should be coming back online shortly.
Normal service has been restored.
Read the full article and comments
Web35 rebooting (fixed)
Updated Jun 11 at 05:50 CDT (first posted Jun 5 at 15:41 CDT) by Sean in Downtime - 0 comment(s)
The main filesystem on Web35 went read-only a few minutes ago. We are rebooting Web35 now, and will perform a filesystem check before bringing the server back online. We'll update this entry as the situation develops.
Update @ 18:05 CST: Filesystem check on Web35 is complete, and Web35 is back online.
Read the full article and comments
Datacentre issue updates for June 4 (fixed)
Updated Jun 11 at 05:47 CDT (first posted Jun 4 at 09:00 CDT) by Sean in Downtime - 0 comment(s)
Update (June 4, 09:00 CDT): Recovery work on Web27 is ongoing - we hope to have it completed within the next 12 hours. Web35 is currently down for a reboot and disk diagnostics. We hope to have service to Web35 restored soon.
Update (June 4, 10:44 CDT): Recovery work on Web27 is complete, and most customer sites on Web27 are back online. The server IP address for Web27 has changed - the new address is 70.84.101.162. Customers using third-party DNS servers (eg, not ns*.webfaction.com) will need to update their DNS info to point to the new IP.
Web35 is still down - we hope to have service to Web35 restored soon.
Update (June 4, 12:57 CDT): Web35 is now back online.
Read the full article and comments
Datacentre issue update (fixed)
Updated Jun 11 at 05:47 CDT (first posted Jun 2 at 10:03 CDT) by Remi in Downtime - 0 comment(s)
Here is an update to the previous post: the power has been restored in parts of
the datacenter and most servers and services have been restored (including our
control panel and support system).
The following servers are in another part of the datacenter which will only
have power later (there is no exact ETA but it probably won't be back today):
Krait, Mail2, Web6, Web27, Web28 and Web29
These servers are currently being moved to another datacenter but there is
no ETA either on when it will be completed.
In the mean time we can provide free plans on other servers to anyone
who's currently on Krait, Web6, Web27, Web28 and Web29. If you would like
one of these accounts just open a ticket and let us know.
Also, we have set up a backup mail server to receive e-mails sent to
Mail2 and store them until Mail2 is back up.
Update: All servers are now back up except Web27. We're still trying to get Web27 back up but there is no ETA on when this will happen (of course, you can still get a free plan on another server if you're on Web27).
Update: Due to an issue with a one of the backup generators the following servers lost their connectivity: Krait, Mail2, Web6, Web27, Web28 and Web29. The datacentre team is working hard to restore connectivity.
Update (June 3, 09:40 CDT): The datacentre is currently testing some new backup generator. Hopefully connectivity will be restored soon.
Update (June 3, 12:09 CDT): Unfortunately using the backup generator didn't work and another backup generator is currently being delivered to the datacentre.
Update (June 3, 15:54 CDT): The backup generator has arrived at the datacentre and is now being filled with fuel and tested.
Update (June 3, 19:15 CDT): The backup generator is now working and all servers except Web27 have come back up. Web27 isn't coming back up and we're currently investigating what the problem is.
Update (June 3, 20:10 CDT): Web27 is going into a kernel panic but the RAID array doesn't report any faulty drive. At this point we're pursuing two options in parallel: try to fix Web27 and restore people's data on another server from backups, whichever will be done the quickest.
Read the full article and comments
Network outage affecting several servers (fixed)
Updated Jun 11 at 05:48 CDT (first posted May 31 at 18:22 CDT) by Richard in Downtime - 0 comment(s)
Issues at one of our data center are causing downtime on several servers. We are working to resolve this problem as soon as possible.
Update: Today at approximately 5:45 p.m. CDT, a transformer in one of The Planet's Huston datacentres caught fire, requiring them to take down all of the generators on site on the instructions of the fire department. This is one of six data-centres used by WebFaction. All servers hosted at that datacentre are currently offline.
Update: No servers in the datacentre have been damaged. However, they are still down down because power is still out.
Update: The datacentre staff are still working to restore power to all affected servers
Update: A few minutes ago The Planet posted some more information about the outage. Here are a couple of excerpts from the post:
This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.
Update: The datacentre staff are still working to fix the various network and power various issues caused by the fire. There is currently no firm estimate for when everything will be back on line.
Update: The datacentre staff are still working to restore power to the datacentre. Here's what they have to say:
We expect to be able to provide initial power to parts of the H1 data center beginning at 5:00 p.m. CDT. At that time, we will begin testing and validating network and power systems, turning on air-conditioning systems and monitoring environmental conditions. We expect this testing to last approximately four hours.
Following this testing, we will begin to power-on customer servers in phases. These are approximate times, and as we know more, we will keep you apprised of the situation.
Meanwhile, we are still working to get our customers' services up and running in a different data centre. It's currently unclear which is going to be the quicker fix (restoring power to the data centre or moving all our services) which is why we're pursuing both options.
Update (Jun 2, 8:40 am UTC): Second data center floor has been cooled down and restoration of power is in the process. Most of our servers are located on the second floor.
We've got a full staff in the data center to power up racks in sections and verify that the server hardware starts up successfully. This process may take a few hours to restore service to all customer servers on the second floor.
Update (Jun 2, 9:30 am UTC): Here's a list of the affected servers:
- mail2.webfaction.com
- krait.webfaction.com
- web5.webfaction.com
- web6.webfaction.com
- web7.webfaction.com
- web27.webfaction.com
- web28.webfaction.com
- web29.webfaction.com
- web30.webfaction.com
- web31.webfaction.com
- web33.webfaction.com
- web34.webfaction.com
- web35.webfaction.com
- web37.webfaction.com
- web40.webfaction.com
- web41.webfaction.com
- dweb1.webfaction.com
- dweb23.webfaction.com
- dweb26.webfaction.com
- dweb27.webfaction.com
Of those all except mail2, krait, web5, web6, web7, web28, web29 and dweb1 are on the second floor of the datacentre.
The servers on the second floor are being powered up in batches currently. They should all be up and running within the next few hours.
The servers on the first floor are unlikely to receive power today. We're working to move all of the service and sites hosted on them to backup servers.
Update (Jun 2, 11:20 am UTC): The following servers are now up and running:
- web5.webfaction.com
- web7.webfaction.com
- web30.webfaction.com
- web31.webfaction.com
- web33.webfaction.com
- web34.webfaction.com
- web35.webfaction.com
- web37.webfaction.com
- dweb1.webfaction.com
- dweb23.webfaction.com
- dweb26.webfaction.com
- dweb27.webfaction.com
Read the full article and comments
Mail2 is down (fixed)
Posted May 22 at 10:50 CDT by Sean Fulmer in Downtime, Problems - 0 comment(s)
Mail2 is currently down - we have a pending reboot request in at our data center and service should be restored shortly.
Update: mail2 is back online.
Read the full article and comments
Web21 re-booted (fixed)
Posted May 12 at 08:45 CDT by Sean Fulmer in Downtime - 0 comment(s)
Apache and SSH stopped responding on Web21 this morning, so we had to reboot it.
The server is back online now. We're still investigating the root cause of the issue issue.
Read the full article and comments
Web3 down (fixed)
Posted Apr 24 at 09:19 CDT by Remi in Downtime - 0 comment(s)
We're fixing a problem with sshd on Web3 and the server is currently down. It should be back up in a few minutes.
Update: Web3 is now back online
Read the full article and comments
Web27 rebooting (fixed)
Posted Apr 19 at 14:36 CDT by Remi in Downtime - 0 comment(s)
One of the drives in Web27's RAID died and we had to take it offline for a few minutes to replace the drive. It will come back online shortly and the drive will be rebuilt while the server is online.
Update: The server is now back online with a new drive.
Remi.
Read the full article and comments
Web21 Down (fixed)
Posted Apr 17 at 04:00 CDT by Richard in Downtime - 0 comment(s)
Web21 is down right now for an unknown reason. We're investigating the problem and will update this ticket with our progress.
Update: The server has been rebooted and is responding again. We're still investigating the root cause of the issue issue.
Read the full article and comments
Web27 Down (Fixed)
Posted Apr 16 at 13:39 CDT by Richard in Downtime - 0 comment(s)
Web27 stopped responding and needed to be rebooted. It's back up now after being down for a few minutes. We're investigating the cause of this outage.
Read the full article and comments
Web4 down (fixed)
Posted Apr 7 at 07:43 CDT by Remi in Downtime - 0 comment(s)
One of the drives on Web4 went to read-only. We are rebooting the server now.
Update 1: The server is not responding after the reboot. We're currently investigating the issue.
Update 2: After running fsck everything appears to be working fine.
Read the full article and comments
Web28 Down (fixed)
Posted Mar 12 at 03:46 CDT by Richard in Downtime, Problems - 0 comment(s)
Web28 is down right now for an unknown reason. We're investigating the problem and will update this ticket with our progress.
Update: the server was having hardware issues. We replaced the server with new hardware and move the drives to the new server. Everything is back to normal now.
Read the full article and comments
Web3 Problems (fixed)
Posted Feb 16 at 12:13 CDT by Richard in Downtime - 0 comment(s)
We have had a couple of outages on web3 in the last 24 hours. We originally had to reboot the server after it stopped responding. That should have been the end of the story but it seems that a configuration file got corrupted at some point so when the server came back up again it didn't have enough of various resources. This lead to a range of seemingly random problems which we've been chasing down and fixing ever since the original reboot.
The good news is that everything seems to be fixed now. That includes the broken config file, so if the machine even needs to be rebooted again it should be back up and running in a couple of minutes. We're very sorry for any inconvenience this may have caused.
Read the full article and comments
Kernel upgrade on CentOS5 servers (done)
Posted Feb 13 at 08:59 CDT by Remi in Downtime - 0 comment(s)
CentOS5 have released a kernel update for the local root exploit announced on Sunday night. We're going to remove our patch and upgrade the kernel on our CentOS5.
The following servers will be upgraded and rebooted shortly: dweb18 to dweb20, web21 to web29 and mail5.
We will update this post when the work is complete.
Update: The upgrade is now finished.
Read the full article and comments
Taipan down (fixed)
Posted Jan 13 at 12:58 CDT by Richard in Downtime - 0 comment(s)
Taipan is currently down. It has been having connectivity problems over the last few hours which have been getting progressively worse. We're investigating the problem and will update this ticket with our progress.
Update: Taipan is now back up.
Read the full article and comments
Web6 drive replacement (fixed)
Updated Dec 31 at 03:46 CDT (first posted Dec 30 at 06:51 CDT) by Richard in Downtime, Scheduled downtime - 0 comment(s)
As we mentioned yesterday the primary drive in Web6 needs replacing. It is going down now for this work and will be unavailable until it is completed. We'll update this ticket with our progress.
Update: Web6 is now back up. FYI, Web6 is one of a few old servers without RAID, so it is vulnerable to a single drive failure. All of our other servers have RAID so a single drive failure doesn't affect them.
Read the full article and comments
Web6 down (fixed)
Updated Dec 29 at 12:49 CDT (first posted Dec 28 at 15:42 CDT) by Remi in Downtime - 0 comment(s)
Web6 is down right now for an unknown reason. We're investigating the problem and will update this ticket with our progress.
Update: Web6 is now back up. A misconfiguration in the firewall caused it to become unavailable.
Update 2: We're going to replace the primary drive in Web6 tomorrow. We'll post a new message before we start the work.
Read the full article and comments
Web6 down (fixed)
Posted Dec 27 at 01:50 CDT by Remi in Downtime - 0 comment(s)
The filesystem on Web6 went to "read-only" mode. We're rebooting Web6 and when it comes back up we'll check the filesystem. We'll update this post when things are back to normal.
Update: Web6 is now back up. Total downtime was 20 minutes. The disk check didn't find anything abnormal. We'll keep a close eye on it though.
Read the full article and comments
Web24 down (fixed)
Posted Dec 14 at 12:01 CDT by Remi in Downtime - 0 comment(s)
Web24 is down at the moment and will have to be manually rebooted. Once it's back up we'll update this ticket and will investigate what caused the crash.
Update: Web24 has been back for a while now. Total downtime was 15 minutes (since we had to manually reboot it). Our auditing tools didn't find anything unusual before the crash. We'll keep a close eye on the server.
Read the full article and comments
Servers down after running up2date (fixed)
Posted Nov 20 at 04:01 CDT by Remi in Downtime - 0 comment(s)
For an unknown reason, a dozen of our servers went down after we applied the latest up2date patches. We're currently working on getting these servers back up ASAP and will update this post as soon as the problem is fixed.
The update went well on all the other servers and our test servers.
Update 1: Several of the servers are back up now. We're working our way through the rest.
Update 2: All servers apart from Krait are back to normal. Downtime was between 30 minutes and 2 hours depending on the server. The problem was that sshd got misconfigured after up2date and it didn't come back after a reboot. In the future we will apply up2date patches to servers gradually to avoid these problems. Krait is taking longer to come back up because it runs an older version of sshd and it's taking longer to fix it. We will update this post when Krait is back to normal.
Update 3: Krait is now back to normal.
Read the full article and comments
Web4 Down (fixed)
Updated May 9 at 02:18 CDT (first posted Mar 23 at 06:36 CDT) by Richard in Downtime - 0 comment(s)
Web4 is down again and we're waiting for it to reboot.
We'll update this post with more info as soon as we know more.
Update:We've had to run FSCK again and are waiting for it to complete. Once it's back up we will look into migrating all accounts off this disk which clearly can't be trusted.
Update:The machine is back up now.
Read the full article and comments
FSCK on Web4 (done)
Updated Jun 11 at 05:47 CDT (first posted Mar 22 at 23:20 CDT) by Remi in Downtime - 0 comment(s)
We had to reboot Web4 and we're currently running FSCK on it. It should be back up shortly. We'll update this post when it's ready.
Update: The server is now back online.
Remi.
Read the full article and comments
Problems on Web4 (fixed)
Updated Jun 11 at 05:49 CDT (first posted Jan 25 at 08:39 CDT) by Remi in Downtime - 0 comment(s)
Web4 just became unresponsive and we're currently rebooting it. When it's back up we'll investigate the problem and will update this entry.Update 1: Looks like one of the services is keeping the server from booting properly ... Booting it in failsafe mode ...
Update 2: The server is back online now. We'll investigate what caused the problem and will update this post again.
Update 3: We've checked the RAM and the hard drives on the server and haven't noticed any flaw. Our rootkit and intrusion detection tools haven't found anything either so everything looks OK. We'll keep a close eye on it though.
Remi.
Read the full article and comments