17/09/2005 @16:07:43 ^17:31:27

CAT HAS TROPHY

Whenever I see the word "catastrophe" I think of an old Korky the Cat story from The Dandy where a newspaper editor wants his reporter to find "a disaster or a catastrophe" and the reporter finds Korky winning a snooker competition, thus creating a pun so horrible it could engulf the entire world!

Please don't let it be the hard discs!!

Okay so if you're reading this it means my site's back up, which is a relief. I know it says at dog that I will put a note there in case of loss here but a) often I can't update there because I've lost connectivity here and b) I can never be bothered anyway.

Here's what happened: caco (the server) suffered some form of hardware failure some time in the early hours of Wednesday (14th) I didn't notice right away but around 3pm Wednesday I noticed a load of slowdown and weird noises. I looked in the logs and found them full of DMA errors and blah and I thought "oh hell not the hard discs"

I immediately tried to copy stuff off of caco onto spare disc space in baron (the desktop) Unfortunately that only got so far before baron froze up due to NFS and caco became unresponsive too, doing nothing but spewing "interrupt lost" messages over and over again. I think at this point I decided there was nothing else to do other than risk horrible data loss and press the reset buttons.

However, it seemed that both discs had refused to work in DMA mode but returned perfectly good results from smartctl. Also I had found access to caco's rarely used CDROM drive wasn't working either. Two hard discs and a CDROM all failing at the same time I thought was quite unlikely, I figured it was more likely to be the thing they're all plugged into, that is, caco's motherboard.

I put caco's discs into baron and booted baron from a live CD I still had from a few months ago, and aside from fiddling about to get LVM to work, the drives seemed to be fine. That was somewhat of a relief. However I tried caco again and it was totally refusing to even power up, I'd turn on the PSU and press the power button and the cpu fan would spin for a fraction of a second and that was it. I removed everything except the CPU from caco's motherboard and tried again, same result. I gave up and went to sleep.

I was quite miserable and it took half of Thursday (15th) to be bothered to work out what to do next. baron relies rather heavily on caco to be a firewall and file server but fortunately still has a dhcp client on it, so having disabled a number of services I plugged the modem in directly and managed to get it online. However having no access to my home directory - it's a good thing I've yet to get round to centralising the passwd database, else I wouldn't have been able to log in at all! - left it looking bare and spartan and frankly alien, and having gone to ebay and found a replacement motherboard (somehow managing to avoid my usual need to think about any purchase for at least a month before making it) I gave up again and switched it off.

Not even sure that buying a replacement motherboard was the right thing to do I wasted the rest of Thursday and the entirety of Friday watching some of the collection of stupid films I've carefully recorded off the TV over the years.

Saturday (17th) began with a pleasant surprise - the package arrived several days before I thought it would, indeed I didn't know they did deliveries on Saturday. The woman driving the post office van asked me how long I'd been growing my hair, which was nice. I took the new motherboard upstairs and eventually managed to fit it. This wasn't quite a trivial task. I really hate CPU heatsinks. This one wouldn't come off. I could only lever it off by removing the fan from the top first to get it out of the way. Oh well. Fortunately though I had some thermal grease but probably made an awful mess of my first attempt to apply it. I took the opportunity to switch out the 600MHz processor for baron's old 800MHz, something I'd been meaning to do for ages, and also I put the metal back plate back into the case (I'd pulled it out in 2003 because it blocked access to the motherboard's on-board ethernet, but I don't need that now since it has two network cards anyway and some of the dust inside the case looked a lot like cobwebs and how stupid is it to lose a motherboard because a spider has laid its eggs on it!)

Well obviously this has a happy ending because here we are, but just to spell it out, I put everything back together, turned it on, and it booted apparently fine. Having edited a few configuration files to deal with the inevitable reshuffling of the numbering of ethernet devices that Linux always does when you mess with them, it's all working fine. For now. I hope.

Casualty list: One drum and bass show I didn't bother to record, a few files in /tmp I didn't care about, my uptime, and my external IP address (I don't like the new one!)

I guess I need to source a few redundant replacements and keep a parts bin of bits of hardware to swap in in case anything fails suddenly again in the future. But like I keep saying, I'm just glad I didn't lose any data - I have backups now but if both discs in the server fail simultaneously I'm still just as screwed...