03 June, 2015

Another (Couple of) Days in the IT Grind

I do indeed realize there are far worse things that could have happened to me, but the past couple of days have not been good.  I am a technologist, and as such, I get very uneasy and tense when the technology I own fails.

It started out during experimentation with installing Xubuntu on a Pentium II (P-II) 450 machine.  What I had completed earlier this week was to take apart a failed ATX-style power supply, unsolder all its output cables, take apart a Dell power supply (which has proprietary wiring of its 20P20C plug), desolder the proprietary and solder in the standard 20P19C plug.  I don't care if I blow up this P-II-450 system, because it is one of the lowliest of capable systems I have here at home, and also a bit wonky at times.  So it was the natural target for testing of my power supply unit (PSU) cable transplant job.

It turns out that the wiring job went well, no magic smoke or sparks were released for either the PSU or the computer.  As just mentioned, it is a bit of a funky system, and with the transplant PSU, it seemed to want to boot off optical disk OK but not hard disk (HDD).  With another supply I have, it didn't seem to want to boot off optical (got to a certain point in the Xubuntu 12.04 disc and rebooted) but the HDD seemed to operate, albeit with a bootloader installation error which I was trying to remedy (hence needed both drives to operate well).  For whatever oddball reasons, a brand new PSU, less than one hour power on time, finally operated the computer, HDD, and CD OK.  (Since then, the other PSUs seem to work too, don't know what changed other than all the unplugging/plugging.)

The first weirdness was this ancient Intel mainboard was complaining about being able to read the SPD (???) of the RAM I put into it (had 2 x 128M DIMMS, which I was replacing w/ 2 x 256M + 1 x 128M).  So I puttered around with Google and Intel's legacy support site, and managed to make up a BIOS update floppy.  After flashing, the SPD error did not go away, and (probably because of that) it will no longer "quick boot" (skip the RAM test), PLUS I haven't found a keystroke which will bypass the RAM test.  It checks somewhere around 10-20 MB per second, so it takes the better part of a minute before it will boot anything (CD or HDD).

After getting a working Xubuntu 12.04 set up, I tried doing the in-place 14.04 upgrade.  That worked kind of OK, except the X server would not run (log showed it was SIGSEGfaulting).  MmmmmKay, maybe something did not work well during the upgrade, so let's try a straight 14.04 installation (which had to be done with the minimal CD, because the full disc is more than one CD-R, so must be USB or DVD).  This implies installing almost everything over the Internet.  This computer is so old it doesn't boot from USB, and I don't really have a spare DVD drive, so over the Internet it was.  Unfortunately, it had the same result, the Xorg server would not stay running.

On one reboot while trying some fixes, the boot just froze.  I found out this was due to not getting a DHCP address (network initialization).  So I arose from my basement to find my Internet router (a P-II 350) locked solid.  That fortunately rebooted OK.  That prompted me to get a rudimentary daemon going to drive a watchdog timer card I had installed a few months ago after my previous router went splat.

After getting home from my volleyball match last night, I found myself able to log onto the router, but off the Internet.  I rebooted, and I was back online.  I may have been able to get away with ifdown eth3 && ifup eth3, but I didn't think of it at the time.  I also reinstated the command to send an email when booting was complete.

I awoke this morning to see that sometime after 3am it had been rebooted, no doubt due to the watchdog timer card tagging the reset line.  In retrospect, this is when the system gets really busy reindexing all pathnames for mlocate.  I have since adjusted my daemon to call nice(-19) to give it the highest userland priority.

I had been watching the latest EEVBlog Fundamentals Friday on BJTs and MOSFETs when I tried to leave a comment.  And YouTube's little "loading the comments" icon would not disappear and show me the comments (and the blank for another one).  I found out the router was routing with whatever it had in RAM, but it was spewing oodles of disk access errors on the console.  Presumably it needed something on disk in order to complete DNS recursion or something.  I couldn't even log onto the router.  I just had to "let it ride."  It immediately made me very nervous, because so much I have relies on the functioning of that router: some public DNS zones, email, Google Voice VOIP, routing between my VLANs, DHCP, Hurricane Electric's 6in4 tunnel/radvd, and on and on.  The worst of it is that static IPv4 addressing is horrendously expensive (Verizon (for FiOS) charges $20/month more than a DHCP account), and while TWC's leases are a week, Verizon's are a short TWO HOURS.  So let's just say, there are a whole lot of little headaches strewn throughout the Internet which require attention when my IPv4 address changes.  So being inaccessible more than 2 hours could add insult to injury.

Needless to say, instead of looking forward to some YouTube watching and Google+ reading, immediately the character of the day changed radically.  It was "beat the clock" time, with finding a replacement computer to use for the router, installing sufficient RAM and HDD in it, and restoring something bootable from backups.  There was no easy way to see if dhclient was continuing to renew the lease for "my" IPv4 address (as it would be tryiing to write a renew notice to the syslog, which would be likely failing badly).  My nerves were frazzled, my hands shaking.  I kept on thinking, got to follow the attitude, stay as calm as possible under the circumstances, just work the problem one step at a time as the step arises.

Thinking that I might have to replace the whole computer, I screwed a spare 20GB HDD into computer.  Later through the process, I thought it better to at least try removing the current HDD and substituting a rewritten from backup one (I thought, great, wasted time getting back online).  So I booted an Ubuntu 12.04 Rescue Remix CD, laid out partitions, formatted, mounted them up into one neat tree under /mnt/rootin ("rootin" is the router's name), used rsync to copy from the backup USB disk onto this new disk (which took about 30 minutes), and do grub-install to make the disk bootable.  On reboot, the kernel panicked because it cannot find init.  Reading back in the kernel messages a little further, the root filesystem could not be mounted because that particular kernel could not handle the inode size chosen by the mke2fs on the Rescue Remix.  ARGHH!!  that was the better part of an hour basically wasted.

So I dug out the CD which I used to build the router initially, booting from it into rescue mode.  I used its mke2fs all over again (wiping out my restore).  Rebooted to the Rescue Remix, rsync, grub-install, reboot.  This time, it worked OK, at least in single user mode.  Things were looking up for the first time in two hours or so.

To add to my already frazzled nerves during this, when trying to switch from one computer to another with my KVM switch, my CRT would only display black.  Humf.  Suspecting this was the KVM switch's fault because it had been operating so long, I switched it off then on...no effect.  For positions where I expected power-saving mode, the monitor's LED was orange, for those where I expected a picture, green, but still no picture on the tube.  Indeed I do have another CRT in the basement, but it's a 19" whereas the one which was wonky this morning is a 22", so quite a visual advantage.  Thankfully it wasn't totally dead, I powercycled the monitor, and it was better.

I decided I would give it one more try though.  I tried Ctrl-Alt-Del on the router's console.  That did not go well.  It immediately started spewing more disk access errors on the console (could not write this, bad sector read that, a technician's horror show).  As this kernel has neither APM nor ACPI support, hitting the power button brought it down HARD.  When turning it on, I expected POST would not even recognize the disk.  Surprisingly though, it booted OK.

But here are the things I'm thinking about as this incident winds down.  One, I wish I did not get so worked up about these technical failures.  For me, email would stop (but it would presumably queue up at its sources), and a bunch of conveniences would be inaccessible.  I can't seem to put it into perspective.  Wouldn't it be a lot more terrible if I were in parts of TX (floods) or Syria (ISIL)?  Two, at least now I have a disk which I can fairly easily put into the existing router should its disk decide to go splat for good (such as won't even pass POST).  Three, at least I have a fairly complete checklist for IPv4 address changes, I just have to calm down and execute each step.  Four, I have some new practical recovery experience for my environment.  In theory, that SHOULD help calm my nerves...but I can't seem to shake that feeling of dread when things go wrong.  I know it's no fun being in the middle of being down, but I wish I could calm down when this sort of thing happens.  Heck...I get nervous at the thought of just rebooting ANY of the computers in my environment.  I would guess what tweaks me the most is not knowing what the effort will be to restore normal operation.

I think what I really need is a complete migration plan as much away from in-home solutions as I can manage.  That way when stuff fails at home, there is less to lose.  But that's going to cost at least some money, for example for a VPS somewhere.  Sigh.


Direct all comments to Google+, preferably under the post about this blog entry.

English is a difficult enough language to interpret correctly when its rules are followed, let alone when the speaker or writer chooses not to follow those rules.

"Jeopardy!" replies and randomcaps really suck!

Please join one of the fastest growing social networks, Google+!