The week before last I had a hardware failure following a power outage and it turned out to be a major saga. My TV Digibox had effectively died and I was going to describe the pains I had to go through to get back up and running with my 100 or so hours of programmes restored, but no sooner was I poised with pen (keyboard) and paper (admin web page) than another issue arose, far more important than blogging or rescuing footage of Woodstock and Coronation Street (ok the Coro rescue was a very high priority I grant you).
My main server (debian squeeze) developed a problem wherein network packets were being dropped (seemingly randomly) for up to 30 seconds at a time. This might not seem like a big deal but my ssh connections were dropped, all my network mapped drives were disconnected and none of my backups could complete.
So where to start looking. Well let’s take the obvious:
- network card (no, got two of them in the machine and both exhibit the same problem)
- network cable (no, got scores of those and they are all good)
- network switch (no, got two of them and changing over from one to the other made no odds)
- other machines interfering (powered them all off to no avail)
- complete power off/on of all machines and switches
- disconnect / reconnect the ADSL
Now let’s think about the less obvious and more difficult to diagnose – perhaps the motherboard?
Dell were very helpful in that they had a spare motherboard (the server is quite old but otherwise sturdy) but only on an exchange basis (and at £185 not exactly a bargain) but their diagnostics would not determine if there was a fault that matched the symptoms.
I added another linux box to the network and it behaved perfectly well, everything steady as a rock – mapped drives, ssh connections etc. So my conclusion at that juncture was that it was either a motherboard fault or a debian update causing the issue. I had originally installed ‘squeeze’ from a release candidate download but it has had regular updates and has been performing flawlessly.
Next up then, replace the main disk and install ubuntu from a virgin 10.04LTS CD. I had some difficulties with the networking which should have been a clue but finally got it all resolved. My joy was short lived as the problem materialised again within half an hour.
Having replaced the original drive again and left the USB drives diconnected to simplify the setup, I also took one of the switches out of the equation and left all but one (windows) box on. Again it seemed to be fine for about 30 minutes.
I left mtr (My Trace Route) running between the debian box and the windows box. I was able to use the web on both boxes and copy files between them but only for a few minutes before the packet loss occured once again – even just between the two machines on the LAN.
I’ve left out some of the more esoteric things I tried for fear of embarrassment, but this morning, still in a state of despair, the only thing that stopped me buying a replacement server was that is is Easter Sunday.
What are the odds that both onboard and additional network cards might be dodgy? The hardware is all in its sixth year and runs 24/7 – statistically I don’t know what the odds are but it was worth a shot, so I borrowed a card from another machine. Ok so I might have had more chance of winning a lottery.
I was gazing out at the beautiful sunshine and thinking of writing off yet another day’s work when a thought struck me – only an outside chance, but I had disconnected the ADSL rather than rebooting it – the only piece of kit I hadn’t completely restarted.
You already know then end of this saga …