Emergency maintenance on Web65, September 9th, 2013.

Posted in Downtime by

Web65 filesystem went read only, We’re currently taking the server down to run a fsck. We will update this post as maintenance progresses.

2013-09-09 16:30 UTC Fsck finished and is now scanning for multiply-claimed blocks.

2013-09-09 20:37 UTC:  The multiply claimed blocks check has now reached the point where the file system may be too unstable to boot. We’re stopping the file system check and preparing a new server to which we can migrate the data. We’ll post more information when it is available.

2013-09-09 22:49 UTC: The new machine is setup and online, we are now in the process of transferring the valid data from the old server and our backups to the new machine. The new server’s IP address is: 75.126.173.131.

2013-09-10 00:57 UTC:  We are still restoring data to the new server.

2013-09-10 02:48 UTC:  We are still restoring data to the new server.

2013-09-10 05:34 UTC:  We have transferred approximately 155G of data to the new server. The transfer of the MySQL and PostgreSQL databases have finished. We will now begin restoring the databases.

2013-09-10 08:01 UTC: We have now transferred approximately 221G of data to the server. All MySQL and PostgreSQL databases have been reloaded. Log in has been enabled for users.

2013-09-10 17:19 UTC:  We have finished restoring the server from our backup servers and we are continuing to transfer the files from the old server. Due to the extent of the file system damage some files in the transfer from the old server may be damaged or missing so we have opted to use the data from our backup servers.

2013-09-11 04:40 UTC: The data from our backup systems is still coming, if you notice any files or databases still missing please open a support ticket so we can look into it further.

-
-

[Fixed] Intermittent network routing issues affecting several US servers

Posted in Problems by

Several of our US servers are currently experiencing intermittent network routing issues. Affected servers may include: Dweb100 Dweb101 Dweb102 Dweb104 Dweb105 Dweb110 Dweb111 Dweb112 Dweb113 Dweb114 Dweb115 Dweb116 Dweb117 Dweb118 Dweb119 Dweb120 Dweb121 Dweb122 Dweb123 Dweb124 Dweb125 Dweb126 Dweb127 Dweb128 Dweb129 Dweb130 Dweb133 Dweb134 Dweb135 Dweb137 Dweb140 Dweb141 Dweb142 Dweb143 Dweb144 Dweb145 Dweb146 Dweb147 Dweb149 Dweb150 Dweb151 Dweb152 Dweb153 Dweb154 Dweb158 Dweb160 Dweb161 Dweb162 Dweb163 Dweb164 Dweb91 Dweb92 Dweb93 Dweb94 Dweb95 Dweb96 Dweb97 Mailbox8 Web102 Web105 Web106 Web108 Web11 Web114 Web117 Web119 Web12 Web122 Web126 Web129 Web143 Web148 Web15 Web151 Web155 Web162 Web174 Web175 Web178 Web180 Web182 Web183 Web186 Web187 Web198 Web199 Web200 Web213 Web219 Web220 Web226 Web227 Web228 Web229 Web230 Web231 Web232 Web233 Web234 Web235 Web236 Web237 Web238 Web239 Web24 Web240 Web241 Web243 Web244 Web245 Web246 Web247 Web25 Web27 Web28 Web30 Web300 Web301 Web302 Web307 Web308 Web309 Web31 Web310 Web311 Web312 Web313 Web318 Web319 Web320 Web324 Web328 Web329 Web330 Web335 Web336 Web337 Web338 Web34 Web341 Web342 Web343 Web344 Web345 Web346 Web347 Web348 Web349 Web35 Web37 Web39 Web4 Web40 Web42 Web48 Web49 Web5 Web55 Web57 Web65 Web69 Web70 Web72 Web74 Web75 Web80 Web83 Web91 Web95 Web99

We’re working to resolve this issue and will update this post when we have more information.

2012-10-15 6:07 UTC: The problem was an issue with an upstream network carrier and has been resolved.

-
-

[Done]Emergency maintenance on Web65 Thursday, 14 June

Posted in Downtime by

Web65 has been taken offline for emergency maintenance. The server was experiencing severely high loads and not utilizing any SWAP space despite the OOM killer running. We’re investigating for hardware problems now.

2012-06-13 02:34 UTC: The server is back online but our testing and investigation isn’t finished so the server may go offline multiple times before the maintenance is complete.

2012-06-13 03:06 UTC: We’ve taken the server offline again to replace all of the RAM in the server to rule out RAM being a problem.

2012-06-13 03:43 UTC: The server is back online now and we’re closely monitoring the server to verify that the RAM swap has fixed the problems we were seeing.

2012-06-13 04:00 UTC: We’ll continue to monitor the server closely throughout the night but it looks like the hardware problems we found have been fixed with the RAM swap.

2012-06-13 04:32 UTC: The problems seem to have returned and we have had to reboot the server. We are still monitoring the server to find out the exact cause.

2012-06-14 12:49 UTC: The problem was with one of our backup subsystems. We have corrected it now and the server is stable.

-
-

[Fixed] Network outage affecting several servers

Posted in Downtime by

A network outage is currently affecting the following servers: web58, web60, web61, web62, web63, web64, web65, web66, web67, web68, web69, web70, web71, web72, web73, web74, web75, web76, web77, web78, web79, web80, web81, web82, web83, web134, web135, web136.

Our webmail and support ticket systems are also affected.

We are currently investigating the problem and hope to have it resolved soon. We will update this post with more information as the situation progresses.

18:30 UTC: There was a failure in the switch that services the rack housing the affected servers. The problem has been corrected and all of the affected servers are back online.

-
-