Archives for the 'Downtime' category

Web 35 Down (fixed)

Updated Jun 1 at 12:09 CDT (first posted May 29 at 18:16 CDT) by David Sissitka in Downtime, Problems  - 0 comment(s)

Web 35's root partition has gone read only. We are looking into it now.

[06:16 PM CST] Update: Web 35 is up and running again. There do not appear to be any software problems so we are running a diagnostic test on the server's hardware.

[08:58 PM CST] Update: We are currently running fsck on Web 35.

[010:48 PM CST] Update: The fsck is complete and the server is back online.

Read the full article and comments


Web73 down (fixed)

Posted Apr 27 at 06:16 CDT by David in Downtime, Problems  - 0 comment(s)

Web73 is currently down while we investigate some filesystem errors. We'll update the post as soon as we have more information.

Update (12.40pm GMT): The filesystem on the server is corrupted beyond recovery so we're going to do an OS reload and restore the data from backup. We'll update this post with our progress.

Update (3.30pm GMT): We have now moved the server onto new hardware (in case the filesystem errors were hardware-related) and we have started copying all the data from backup.

Update (5.30pm GMT): The server is now back up with new hardware and the data from yesterday's backup. Note that the RSA host key has changed so your SSH client may display a warning about it.

Read the full article and comments


Drive replacement on web42 (fixed)

Posted Apr 24 at 11:32 CDT by Remi in Downtime  - 0 comment(s)

One of the drives on web42 died and we are currently rebuilding the RAID with the new drive. We will update this post once the server is back online.

2009-04-24 12:26 CDT - the drive rebuild is complete and Web42 is back online.

Read the full article and comments


Mail5 and Webmail problems (fixed)

Posted Apr 17 at 09:24 CDT by Sean in Downtime  - 0 comment(s)

Mail services on mail5.webfaction.com and webmail.webfaction.com are currently not working. We are looking into the problem and hope to have normal service restored soon. We will update this entry as we have more information.

2009-04-17 10:24 CDT - troubleshooting on mail5/webmail is still in progress.

2009-04-17 11:33 CDT - we've just pointed webmail.webfaction.com at a different server IP. Webmail users will be able to access webmail as soon as the DNS change propagates, but you will not have access to your usual webmail address book since it is located on mail5. If your mailbox resides on mail5, you still will not be able to access your mail. Troubleshooting on mail5 is still in progress.

2009-04-17 15:37 CDT - the problem on mail5 appears to be a failed OS upgrade. We are re-installing packages now.

2009-04-17 17:29 CDT - mail5 is back online and webmail.webfaction.com has been pointed back to mail5. All mail5 users should be able to access their mail now, but the server may be slow to respond for the next several hours as it catches up with today's incoming mail.

Read the full article and comments


Web67 Down (fixed)

Posted Apr 14 at 11:40 CDT by Sean in Downtime  - 0 comment(s)

Web67 is currently down while we investigate a potential problem on the filesystem. We will update this entry as we have more information.

2009-04-14 12:13 CDT - A filesystem repair is in progress on Web67. We hope to have service restored soon.

2009-04-14 12:42 CDT - The filesystem repair on Web67 is still in progress.

2009-04-14 12:44 CDT - The filesystem repair on Web67 completed successfully and the server is now online.

Read the full article and comments


Web42 audible alarm (fixed)

Posted Mar 27 at 04:10 CDT by Remi in Downtime  - 0 comment(s)

Web42 is currently down while we investigate an audible alarm. We'll update this post once we know more about the issue.

Update: The problem was a degraded RAID on the server. The server is now back online and the RAID is rebuilding in the background.

Read the full article and comments


Web59 Down (fixed)

Posted Mar 23 at 14:34 CDT by David S in Downtime  - 0 comment(s)

Web59 is currently inaccessible and we're looking into it.

Update: The problem was a misconfiguration in the firewall and it is now fixed.

Read the full article and comments


Web 69 Down (fixed)

Updated Mar 10 at 08:19 CDT (first posted Mar 9 at 20:46 CDT) by David Sissitka in Downtime, Problems  - 0 comment(s)

Web 69 is currently down. Its root partition went read only and rebooting it revealed an issue which we are currently working on resolving now. We will post updates as they are available.

It appears to be an issue with the RAID controller and we are currently replacing the hardware and restoring all data from backup.

2009-03-09 06:00 PST: Web69 is still down having suffered a serious RAID controller failure. We have recovered all of the data from the server and are currently working on restoring it to a new standby server which will replace web69.

2009-03-09 06:16 PST: Web69 is now back online with all its data. We decided to move the data onto a new server to give us more time to check the hardware on the failing server. We copied all the data from just before the crash so no data has been lost.

Read the full article and comments


Web37 Down (Fixed)

Posted Feb 17 at 09:07 CDT by Sean in Downtime  - 0 comment(s)

Web37 is currently down. We are investigating the problem at this time and will update this entry as we have more info.

2009-02-17 9:25 CST - Web37 is back online.

Read the full article and comments


Web64 Down (Fixed)

Posted Feb 13 at 16:05 CDT by Sean in Downtime, Problems  - 0 comment(s)

The filesystem on Web64 went read-only several minutes ago. We're currently rebooting the machine and will have normal service restored ASAP.

2009-02-13 16:33 CST - Web64 is back online. We may need to bring it down again in the near future for filesystem maintenance, but we'll give advance notice if we do.

2009-02-13 18:02 CST - The filesystem on Web64 just went read-only again, so we're going to go ahead and perform filesystem maintenance now. We'll get the machine back online ASAP.

2009-02-13 18:43 CST - The filesystem check on Web64 is still in progress.

2009-02-13 19:13 - The filesystem check is complete and Web64 is back online.

Read the full article and comments


Web4 Down (fixed)

Posted Feb 3 at 04:24 CDT by Sime in Downtime  - 0 comment(s)

Web4 is down with a file system issue. We'll update this post as soon as it is back up.

2009-02-03 10:46 CST - The filesystem check on Web4 is still in progress. We hope to have service on Web4 restored soon.

2009-02-03 11:18 CST - Web4 is back online.

Read the full article and comments


Web 39 DDoS attack (fixed)

Updated Jan 5 at 08:42 CDT (first posted Dec 23 at 17:57 CDT) by David Sissitka in Downtime, Problems  - 0 comment(s)

At the moment Web 39 is being DDoS attacked. We're currently working with The Planet to fend off the attack. We'll keep you updated.

Update [06:15 PM]: The Planet has implemented flood protection at the network level and now everything everything appears to be working as expected again.

Read the full article and comments


[Done] Rebooting Web 24

Posted Nov 10 at 15:54 CDT by David Sissitka in Downtime  - 0 comment(s)

In a couple of minutes Web 24 will be rebooted. Its load is high because it was booted using the wrong kernel and as a result it only utilizes 3.5 GB out of the installed 4 GB of memory. The down time should last no longer than 10 minutes.

Update [04:35 PM CST]: Before we could fix the problem Web 24 went down so the down time is lasting longer than expected. Everything should be back to normal momentarily.

Read the full article and comments


Web24 down (fixed)

Posted Nov 4 at 08:33 CDT by Sean in Downtime  - 0 comment(s)

Web24 stopped responding several minutes ago, so we rebooted it. The server is going through a disk check at this moment. We hope to have service restored soon, and we'll update this entry when we have more info.

2008-11-04 8:59 - Web24 is back online. Initial investigation shows that there was a load spike immediately before the server went down.

Read the full article and comments


Mail1 down (fixed)

Posted Oct 28 at 13:16 CDT by Sean in Downtime, Problems  - 0 comment(s)

Mail1 was the target of a massive backscatter spam attack earlier today, and failed due to extremely high load. We've got the attack under control now, and Mail1 should be responding normally now.

Read the full article and comments


[Done] Rebooting Web 23

Posted Oct 27 at 18:06 CDT by David Sissitka in Downtime  - 0 comment(s)

In a couple of minutes Web 23 will be rebooted. Its load is high because it was booted using the wrong kernel and as a result it only utilizes 3.5 GB out of the installed 4 GB of memory. The down time should last no longer than 10 minutes. Read the full article and comments


Web21 Down (fixed)

Updated Oct 28 at 13:19 CDT (first posted Oct 27 at 13:25 CDT) by Sean in Downtime, Problems  - 0 comment(s)

Web21 stopped responding a few minutes ago, so we're rebooting the machine now. We'll update this post when the server is back online.

2008-10-27 13:37 - Web21 is back online.

Read the full article and comments


Web 24 Inaccessible (Fixed)

Posted Oct 20 at 10:44 CDT by David Sissitka in Downtime  - 0 comment(s)

Web 24 is currently inaccessible. We're waiting for someone at the data center to reboot it now. We'll post updates as we receive them.

Update (10:46 PM CST): Web 24 is up and running again.

Read the full article and comments


Mail services down on mail3 (Fixed)

Posted Oct 11 at 09:58 CDT by Sean in Downtime, Problems  - 0 comment(s)

We've temporarily stopped mail services on mail3.webfaction.com while we deal with a massive backscatter spam attack. We'll update this entry when service is restored.

2008-10-11 10:19 CST: Service on mail3 has been restored.

Read the full article and comments


Web 12 Inaccessible (Fixed)

Updated Oct 11 at 10:02 CDT (first posted Oct 9 at 16:09 CDT) by David Sissitka in Downtime  - 0 comment(s)

Web 12 is currently inaccessible. We're waiting for someone at the data center to reboot it now. We'll post updates as we receive them.

Update (04:29 PM CST): Web 12 is up and running again.

Read the full article and comments


Web42 down (fixed)

Posted Oct 2 at 05:30 CDT by Richard in Downtime, Problems  - 0 comment(s)

Web42 is currently down. One of the disks in the RAID array died and needed to be replaced. The server is down while we carry this out this work. We'll update this post as soon as the server is back online.

Update: (05:55 CDT) The failed disk has been replaced and the server is back online

Read the full article and comments


[Resolved] Web 24 Inaccessible

Posted Sep 29 at 19:50 CDT by David Sissitka in Downtime  - 0 comment(s)

Web 24 is currently inaccessible. We've requested that someone at the data center take a look. We'll keep you updated.

Update (07:58 PM CDT): Web 24 is up and running again. We're currently trying to determine why it went down.

Read the full article and comments


Apache on Web28 is down (fixed)

Updated Sep 27 at 08:44 CDT (first posted Sep 26 at 12:35 CDT) by Sean in Downtime, Problems  - 0 comment(s)

The main Apache instance on Web28 is currently down while we troubleshoot a configuration problem. We'll update this entry when service is restored.

2008-09-26 12:48 CST - Apache on Web28 is back online.

2008-09-26 14:19 CST - We're taking Apache back down for a few more tweaks. We'll update this entry when service is restored.

2008-09-26 19:31 CST - Work on Web28's Apache instance is complete.

Read the full article and comments


Web21 down (fixed)

Posted Sep 24 at 11:45 CDT by Sean in Downtime, Problems  - 0 comment(s)

Web21 is currently down. We are working with the data center to restore service. We'll update this post as soon as the server is back online.

2008-09-24 13:21 CST: Web21 is back online. The problem was a failure in the server's network hardware, which was replaced by the data center.

Read the full article and comments


Web3 Down (fixed)

Posted Sep 20 at 00:10 CDT by David Sissitka in Downtime  - 0 comment(s)

Web3 is down at the moment. It might take a bit longer than usual for the server to come back online because the datacenter is doing some maintenance work at the moment. We'll update this post as soon as the server is back online.

Update: Web3 is now back to normal

Read the full article and comments


Web28 down (fixed)

Posted Sep 16 at 09:39 CDT by Sean in Downtime, Problems  - 0 comment(s)

Web28 stopped responding about an hour ago. The data center has rebooted the server and it is now back online.

Read the full article and comments


[Resolved] Networking Problem

Updated Sep 2 at 10:20 CDT (first posted Sep 1 at 13:55 CDT) by David Sissitka in Downtime  - 0 comment(s)

The Planet appears to be having a networking problem that's making some servers inaccessible to some people. We've contacted The Planet re the problem and we'll post updates as we receive them.

Update (02:07 PM CDT): The Planet is currently working on their DNS servers to resolve an issue with intermittent nameserver resolution. They didn't have an ETA as to when this would be done for us.

Read the full article and comments


Web4 rebooting (Fixed)

Posted Aug 6 at 08:20 CDT by Sean in Downtime, Problems  - 0 comment(s)

The filesystem on Web4 went read-only a few moments ago. We are rebooting that server now and hope to have normal service restored soon.

Update 2008-08-06 08:33: Web4 is back online.

Read the full article and comments


Help system and Mail3 down (fixed)

Posted Aug 4 at 11:56 CDT by Sean in Downtime, Problems  - 0 comment(s)

The help system and mail3 are currently down. We're currently rebooting that server and hope to have normal service restored soon.

Update: 2008-08-04 12:01 - Mail3 is back online.

Update: 2008-08-04 15:22 - IMAP and SSH stopped responding on mail3, so we had to reboot it once more.

Read the full article and comments


Web35 Down (Fixed)

Posted Aug 2 at 20:52 CDT by Richard in Downtime, Problems  - 0 comment(s)

The filesystem on Web35 started having problems a few minutes ago. We are rebooting Web35 now, and will perform a filesystem check before bringing the server back online. We'll update this entry when we have new information.

Update 2008-08-02 21:01: The server is back up now.

Read the full article and comments


Web13 down (fixed)

Posted Jul 28 at 04:44 CDT by Remi in Downtime  - 0 comment(s)

Web13 is currently down for an unknown reason. We'll update this post once we have more information.

Update: Web13 is now back online. The problem was caused by a misconfiguration of the network interface on the machine.

Read the full article and comments


Network issues at ThePlanet affecting multiple servers (fixed)

Updated Jun 11 at 05:48 CDT (first posted Jun 6 at 10:00 CDT) by Sean in Downtime  - 0 comment(s)

An apparent outage at ThePlanet's H1 data center is currently affecting multiple WebFaction servers. We will update this entry when we have more information

From ThePlanet: June 6 10:00am CDT We have lost network connectivity to H1. We are confirming the extent of any power loss, and we will be updating shortly.

From ThePlanet: June 6 10:05am CDT Transport for H1 temporarily fell offline and is restored. H1 Phase 2 did not lose power. H1 Phase 1 lost power. We will be updating again shortly.

From ThePlanet: June 6 10:10am CDT The temporary generator powering Phase 1 failed. We switched over to the backup generators that were just brought in. The CRAC units have been powered on, and PDUs are having power restored right now.

From ThePlanet: June 6 10:15am CDT We continue to power PDUs in Phase 1. Customer servers should be coming back online shortly.

Normal service has been restored.

Read the full article and comments


Web35 rebooting (fixed)

Updated Jun 11 at 05:50 CDT (first posted Jun 5 at 15:41 CDT) by Sean in Downtime  - 0 comment(s)

The main filesystem on Web35 went read-only a few minutes ago. We are rebooting Web35 now, and will perform a filesystem check before bringing the server back online. We'll update this entry as the situation develops.

Update @ 18:05 CST: Filesystem check on Web35 is complete, and Web35 is back online.

Read the full article and comments


Datacentre issue updates for June 4 (fixed)

Updated Jun 11 at 05:47 CDT (first posted Jun 4 at 09:00 CDT) by Sean in Downtime  - 0 comment(s)

Update (June 4, 09:00 CDT): Recovery work on Web27 is ongoing - we hope to have it completed within the next 12 hours. Web35 is currently down for a reboot and disk diagnostics. We hope to have service to Web35 restored soon.

Update (June 4, 10:44 CDT): Recovery work on Web27 is complete, and most customer sites on Web27 are back online. The server IP address for Web27 has changed - the new address is 70.84.101.162. Customers using third-party DNS servers (eg, not ns*.webfaction.com) will need to update their DNS info to point to the new IP.

Web35 is still down - we hope to have service to Web35 restored soon.

Update (June 4, 12:57 CDT): Web35 is now back online.

Read the full article and comments


Datacentre issue update (fixed)

Updated Jun 11 at 05:47 CDT (first posted Jun 2 at 10:03 CDT) by Remi in Downtime  - 0 comment(s)

Here is an update to the previous post: the power has been restored in parts of the datacenter and most servers and services have been restored (including our control panel and support system).

The following servers are in another part of the datacenter which will only have power later (there is no exact ETA but it probably won't be back today):
Krait, Mail2, Web6, Web27, Web28 and Web29

These servers are currently being moved to another datacenter but there is no ETA either on when it will be completed.

In the mean time we can provide free plans on other servers to anyone who's currently on Krait, Web6, Web27, Web28 and Web29. If you would like one of these accounts just open a ticket and let us know.

Also, we have set up a backup mail server to receive e-mails sent to Mail2 and store them until Mail2 is back up.

Update: All servers are now back up except Web27. We're still trying to get Web27 back up but there is no ETA on when this will happen (of course, you can still get a free plan on another server if you're on Web27).

Update: Due to an issue with a one of the backup generators the following servers lost their connectivity: Krait, Mail2, Web6, Web27, Web28 and Web29. The datacentre team is working hard to restore connectivity.

Update (June 3, 09:40 CDT): The datacentre is currently testing some new backup generator. Hopefully connectivity will be restored soon.

Update (June 3, 12:09 CDT): Unfortunately using the backup generator didn't work and another backup generator is currently being delivered to the datacentre.

Update (June 3, 15:54 CDT): The backup generator has arrived at the datacentre and is now being filled with fuel and tested.

Update (June 3, 19:15 CDT): The backup generator is now working and all servers except Web27 have come back up. Web27 isn't coming back up and we're currently investigating what the problem is.

Update (June 3, 20:10 CDT): Web27 is going into a kernel panic but the RAID array doesn't report any faulty drive. At this point we're pursuing two options in parallel: try to fix Web27 and restore people's data on another server from backups, whichever will be done the quickest.


Read the full article and comments


Network outage affecting several servers (fixed)

Updated Jun 11 at 05:48 CDT (first posted May 31 at 18:22 CDT) by Richard in Downtime  - 0 comment(s)

Issues at one of our data center are causing downtime on several servers. We are working to resolve this problem as soon as possible.

Update: Today at approximately 5:45 p.m. CDT, a transformer in one of The Planet's Huston datacentres caught fire, requiring them to take down all of the generators on site on the instructions of the fire department. This is one of six data-centres used by WebFaction. All servers hosted at that datacentre are currently offline.

Update: No servers in the datacentre have been damaged. However, they are still down down because power is still out.

Update: The datacentre staff are still working to restore power to all affected servers

Update: A few minutes ago The Planet posted some more information about the outage. Here are a couple of excerpts from the post:

This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost.

All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.

Update: The datacentre staff are still working to fix the various network and power various issues caused by the fire. There is currently no firm estimate for when everything will be back on line.

Update: The datacentre staff are still working to restore power to the datacentre. Here's what they have to say:

We expect to be able to provide initial power to parts of the H1 data center beginning at 5:00 p.m. CDT. At that time, we will begin testing and validating network and power systems, turning on air-conditioning systems and monitoring environmental conditions. We expect this testing to last approximately four hours.

Following this testing, we will begin to power-on customer servers in phases. These are approximate times, and as we know more, we will keep you apprised of the situation.

Meanwhile, we are still working to get our customers' services up and running in a different data centre. It's currently unclear which is going to be the quicker fix (restoring power to the data centre or moving all our services) which is why we're pursuing both options.

Update (Jun 2, 8:40 am UTC): Second data center floor has been cooled down and restoration of power is in the process. Most of our servers are located on the second floor.

We've got a full staff in the data center to power up racks in sections and verify that the server hardware starts up successfully. This process may take a few hours to restore service to all customer servers on the second floor.

Update (Jun 2, 9:30 am UTC): Here's a list of the affected servers:

  • mail2.webfaction.com
  • krait.webfaction.com
  • web5.webfaction.com
  • web6.webfaction.com
  • web7.webfaction.com
  • web27.webfaction.com
  • web28.webfaction.com
  • web29.webfaction.com
  • web30.webfaction.com
  • web31.webfaction.com
  • web33.webfaction.com
  • web34.webfaction.com
  • web35.webfaction.com
  • web37.webfaction.com
  • web40.webfaction.com
  • web41.webfaction.com
  • dweb1.webfaction.com
  • dweb23.webfaction.com
  • dweb26.webfaction.com
  • dweb27.webfaction.com

Of those all except mail2, krait, web5, web6, web7, web28, web29 and dweb1 are on the second floor of the datacentre.

The servers on the second floor are being powered up in batches currently. They should all be up and running within the next few hours.

The servers on the first floor are unlikely to receive power today. We're working to move all of the service and sites hosted on them to backup servers.

Update (Jun 2, 11:20 am UTC): The following servers are now up and running:

  • web5.webfaction.com
  • web7.webfaction.com
  • web30.webfaction.com
  • web31.webfaction.com
  • web33.webfaction.com
  • web34.webfaction.com
  • web35.webfaction.com
  • web37.webfaction.com
  • dweb1.webfaction.com
  • dweb23.webfaction.com
  • dweb26.webfaction.com
  • dweb27.webfaction.com
Read the full article and comments


Mail2 is down (fixed)

Posted May 22 at 10:50 CDT by Sean Fulmer in Downtime, Problems  - 0 comment(s)

Mail2 is currently down - we have a pending reboot request in at our data center and service should be restored shortly.

Update: mail2 is back online.

Read the full article and comments


Web21 re-booted (fixed)

Posted May 12 at 08:45 CDT by Sean Fulmer in Downtime  - 0 comment(s)

Apache and SSH stopped responding on Web21 this morning, so we had to reboot it.

The server is back online now. We're still investigating the root cause of the issue issue.

Read the full article and comments


Web3 down (fixed)

Posted Apr 24 at 09:19 CDT by Remi in Downtime  - 0 comment(s)

We're fixing a problem with sshd on Web3 and the server is currently down. It should be back up in a few minutes.

Update: Web3 is now back online

Read the full article and comments


Web27 rebooting (fixed)

Posted Apr 19 at 14:36 CDT by Remi in Downtime  - 0 comment(s)

One of the drives in Web27's RAID died and we had to take it offline for a few minutes to replace the drive. It will come back online shortly and the drive will be rebuilt while the server is online.

Update: The server is now back online with a new drive.

Remi. Read the full article and comments


Web21 Down (fixed)

Posted Apr 17 at 04:00 CDT by Richard in Downtime  - 0 comment(s)

Web21 is down right now for an unknown reason. We're investigating the problem and will update this ticket with our progress.

Update: The server has been rebooted and is responding again. We're still investigating the root cause of the issue issue.

Read the full article and comments


Web27 Down (Fixed)

Posted Apr 16 at 13:39 CDT by Richard in Downtime  - 0 comment(s)

Web27 stopped responding and needed to be rebooted. It's back up now after being down for a few minutes. We're investigating the cause of this outage.

Read the full article and comments


Web4 down (fixed)

Posted Apr 7 at 07:43 CDT by Remi in Downtime  - 0 comment(s)

One of the drives on Web4 went to read-only. We are rebooting the server now.

Update 1: The server is not responding after the reboot. We're currently investigating the issue.

Update 2: After running fsck everything appears to be working fine.

Read the full article and comments


Web28 Down (fixed)

Posted Mar 12 at 03:46 CDT by Richard in Downtime, Problems  - 0 comment(s)

Web28 is down right now for an unknown reason. We're investigating the problem and will update this ticket with our progress.

Update: the server was having hardware issues. We replaced the server with new hardware and move the drives to the new server. Everything is back to normal now.

Read the full article and comments


Web3 Problems (fixed)

Posted Feb 16 at 12:13 CDT by Richard in Downtime  - 0 comment(s)

We have had a couple of outages on web3 in the last 24 hours. We originally had to reboot the server after it stopped responding. That should have been the end of the story but it seems that a configuration file got corrupted at some point so when the server came back up again it didn't have enough of various resources. This lead to a range of seemingly random problems which we've been chasing down and fixing ever since the original reboot.

The good news is that everything seems to be fixed now. That includes the broken config file, so if the machine even needs to be rebooted again it should be back up and running in a couple of minutes. We're very sorry for any inconvenience this may have caused.

Read the full article and comments


Kernel upgrade on CentOS5 servers (done)

Posted Feb 13 at 08:59 CDT by Remi in Downtime  - 0 comment(s)

CentOS5 have released a kernel update for the local root exploit announced on Sunday night. We're going to remove our patch and upgrade the kernel on our CentOS5.

The following servers will be upgraded and rebooted shortly: dweb18 to dweb20, web21 to web29 and mail5.

We will update this post when the work is complete.

Update: The upgrade is now finished.

Read the full article and comments


Taipan down (fixed)

Posted Jan 13 at 12:58 CDT by Richard in Downtime  - 0 comment(s)

Taipan is currently down. It has been having connectivity problems over the last few hours which have been getting progressively worse. We're investigating the problem and will update this ticket with our progress.

Update: Taipan is now back up.

Read the full article and comments


Web6 drive replacement (fixed)

Updated Dec 31 at 03:46 CDT (first posted Dec 30 at 06:51 CDT) by Richard in Downtime, Scheduled downtime  - 0 comment(s)

As we mentioned yesterday the primary drive in Web6 needs replacing. It is going down now for this work and will be unavailable until it is completed. We'll update this ticket with our progress.

Update: Web6 is now back up. FYI, Web6 is one of a few old servers without RAID, so it is vulnerable to a single drive failure. All of our other servers have RAID so a single drive failure doesn't affect them.

Read the full article and comments


Web6 down (fixed)

Updated Dec 29 at 12:49 CDT (first posted Dec 28 at 15:42 CDT) by Remi in Downtime  - 0 comment(s)

Web6 is down right now for an unknown reason. We're investigating the problem and will update this ticket with our progress.

Update: Web6 is now back up. A misconfiguration in the firewall caused it to become unavailable.

Update 2: We're going to replace the primary drive in Web6 tomorrow. We'll post a new message before we start the work.

Read the full article and comments


Web6 down (fixed)

Posted Dec 27 at 01:50 CDT by Remi in Downtime  - 0 comment(s)

The filesystem on Web6 went to "read-only" mode. We're rebooting Web6 and when it comes back up we'll check the filesystem. We'll update this post when things are back to normal.

Update: Web6 is now back up. Total downtime was 20 minutes. The disk check didn't find anything abnormal. We'll keep a close eye on it though.

Read the full article and comments


Web24 down (fixed)

Posted Dec 14 at 12:01 CDT by Remi in Downtime  - 0 comment(s)

Web24 is down at the moment and will have to be manually rebooted. Once it's back up we'll update this ticket and will investigate what caused the crash.

Update: Web24 has been back for a while now. Total downtime was 15 minutes (since we had to manually reboot it). Our auditing tools didn't find anything unusual before the crash. We'll keep a close eye on the server.

Read the full article and comments


Servers down after running up2date (fixed)

Posted Nov 20 at 04:01 CDT by Remi in Downtime  - 0 comment(s)

For an unknown reason, a dozen of our servers went down after we applied the latest up2date patches. We're currently working on getting these servers back up ASAP and will update this post as soon as the problem is fixed.

The update went well on all the other servers and our test servers.

Update 1: Several of the servers are back up now. We're working our way through the rest.

Update 2: All servers apart from Krait are back to normal. Downtime was between 30 minutes and 2 hours depending on the server. The problem was that sshd got misconfigured after up2date and it didn't come back after a reboot. In the future we will apply up2date patches to servers gradually to avoid these problems. Krait is taking longer to come back up because it runs an older version of sshd and it's taking longer to fix it. We will update this post when Krait is back to normal.

Update 3: Krait is now back to normal.

Read the full article and comments


Web4 Down (fixed)

Updated May 9 at 02:18 CDT (first posted Mar 23 at 06:36 CDT) by Richard in Downtime  - 0 comment(s)

Web4 is down again and we're waiting for it to reboot.

We'll update this post with more info as soon as we know more.

Update:We've had to run FSCK again and are waiting for it to complete. Once it's back up we will look into migrating all accounts off this disk which clearly can't be trusted.

Update:The machine is back up now.

Read the full article and comments


FSCK on Web4 (done)

Updated Jun 11 at 05:47 CDT (first posted Mar 22 at 23:20 CDT) by Remi in Downtime  - 0 comment(s)

We had to reboot Web4 and we're currently running FSCK on it. It should be back up shortly. We'll update this post when it's ready.

Update: The server is now back online.


Remi. Read the full article and comments


Problems on Web4 (fixed)

Updated Jun 11 at 05:49 CDT (first posted Jan 25 at 08:39 CDT) by Remi in Downtime  - 0 comment(s)

Web4 just became unresponsive and we're currently rebooting it. When it's back up we'll investigate the problem and will update this entry.

Update 1: Looks like one of the services is keeping the server from booting properly ... Booting it in failsafe mode ...

Update 2: The server is back online now. We'll investigate what caused the problem and will update this post again.

Update 3: We've checked the RAM and the hard drives on the server and haven't noticed any flaw. Our rootkit and intrusion detection tools haven't found anything either so everything looks OK. We'll keep a close eye on it though.

Remi. Read the full article and comments