19 October, 2012

Google Profile: Bragging Rights: Hardware Hacking

When you're a member of Google Plus (G+), there are a lot of things one can fill in about yourself.  If I recall correctly, the only ones which are mandatory are a name and gender (and even disclosure of gender can be controlled).  However, there are lots more categories available, such as education, employment history, and so on.  The one I'd like to focus on here is "bragging rights," and fill in some detail which is not on G+.

Way back in the day, my friend Karl had an IBM PC-AT work-alike, with a 80286 processor, a 1.2MB 5.25 inch floppy, and a 20MB Seagate MFM hard disk.1  Karl ran mostly PC-DOS (or was it MS-?  not sure), and DesqView, and all sorts of neat hacks like that for quite a while.  But he thought he would take a chance on a recently published OS, MINIX.  That was great; he let me have an account on that and mess with it while I was visiting.  But it had quite an issue.  There was no way MINIX (or even something like a home directory) was going to fit on 17 disk sectors, which is all the hard disk would accept before erroring out.  That, if you old timers will recall, was merely one track (on one side of one platter), not even a whole cylinder.  So, at least for a while, Karl and I had to be content with operating off floppies and a RAM disk.  But it kind of bothered me...how could they even publish a disk driver if it worked so poorly?

It just so happens his father was a computer consultant type, and as such, had access to or copies of all sorts of technical references, one of which was the programmers' manual for the hard disk controller.  So, I went about studying the controller registers and such, and trying to deduce what the disk code in MINIX was doing.

Lo and behold, there was a certain disk controller register which held head number, cylinder number...y'know, that sort of thing, to read or write.  In essence, it embodied how to cross track and cylinder boundaries, and it was divided bitwise.  When I looked at the driver code, it all seemed to jive fairly well with the tech manual.  But what's this I see?  The various "coordinates" are being assembled together in the assignment of a variable, with various "&" masks, and shifting, and whatnot, but there is a "||" in the code...or a Boolean OR.  Wait a minute...this register needs this in these bits, and that in those other bits...hey, that Boolean OR should be a bitwise OR, or "|" instead!  Jiminey, if we didn't edit the disk driver source, recompile the kernel, and all the sudden more than 17 hard disk sectors became accessible!  Success!  I hacked the hard disk driver!

Of course, this was just as a hobby then, although I was studying for a computer science degree at the time.  I actually had a chance a few years after that to do hardware hacking professionally.  My job was to write some pieces which would facilitate voicemail system vendors' integration with the Simplified Message Desk Interface (SMDI) protocol.  This involved writing a character device driver for MS-DOS which would take in data from a serial port and store it for retrieval by the voicemail software.  One of the major lessons learned there was that sometimes manuals "lie."

In the interrupt routine for the serial port, I would read the status register which contains bits for control lines changing (say, losing carrier detect), for character written, for character received, and whether there were any more interrupts pending.  On some systems, I would wonder why data stopped being received from the phone switch, and thus making my routines reject the SMDI data which were being sent (format errors, basically).  It turns out the "interrupt processing done" flag bit was being set, even though the "character received/ready" bit was set also!  It was a definite sign that either some of the UARTs were not interrupting enough, or interrupts were being lost.  I stopped relying on the "done" bit, and checked each bit individually.  It seemed less efficient to have to do that, but every once in a long while we come across this need to work around hardware bugs. I also ended up additionally hooking the timer interrupt, so periodically there was a double check for reception of characters.  At 1200 bps (or 120 characters/second), and timer ticks at a little more frequent than 18 Hz, it was enough to make almost all the reception errors go away.

I've also had a long-term interest in Linux internals.  Really, the build scaffolding for the Linux kernel has become very refined over the years, making it very easy for anyone with a minimum of system skills to tinker with the kernel.  I used to want to optimize the booting time by going through the configuration and including only the bare minimum to get init going, which meant basically console, disk, and root filesystem drivers, and everything else possible as a module.  From there, anything needed by init or beyond could be demand-loaded by the kernel merely by triggering a module load through accessing a device node.  (Of course these days there are facilities to enumerate all the hardware present, such as scanning PCI and USB buses, and everything gets loaded and initialized, kinda whether you wanted it initialized or not.)  But I needed even a little bit more control of my kernel than that (no pun intended).

I had bought a serial card for my computer, four ports, two populated with hardware with sockets for two more sets of DIP chips.  It was also capable of being clocked at four times the standard rate, making it capable of 460800 bps.  The thing about x86 Linux at the time is that it had no accommodation in the serial driver for rates greater than 115200.  Oddly enough, the system header files defined constants for the termios interfaces for B230400 and B460800, but they were never used in the serial port driver.  Well...since I wanted to drive my US Robotics Courier at the top speed of which it was capable, 230400 bps, I put some additional code in there to handle the higher, nonstandard rates.  Build, install, run the bootloader installer, reboot, and wouldn't you know...it actually worked swimmingly.  I was thrilled to be a minor, successful Linux kernel hacker.

In yet another incarnation of my main Linux system, I couldn't use that ISA 4 port card (it has only PCI slots), so I got a Keyspan USB to four RS232C port adapter.  These are also capable of more than the standard 115200 bps, and again, I wanted my Courier to operate optimally, so I decided to sling it off the USB adapter.  And that was great...except for one problem.  When I would dial into my system, all the lines in my session would "pile up" at the last line of my terminal (emulator).  "That's odd," I thought.  If I attached the modem to the standard, chipset serial port, things were just fine, so I knew it wasn't my kernel per se.  I was able to track it down, in that the terminal driver checks each outbound character for "\n", and if the right output termios bits are set, it inserts an additional "\r".  I traced this all through the terminal driver down to the kernel, and back up through the USB serial driver and the Keyspan driver.  None of it made sense.

I actually posted my findings of this misbehavior to a newsgroup, asking what would be the wisest way to fix that.  The reply was kind of disappointing.  In essence, it was, yep, we know it does that, and the reason is the packetized nature of USB transfers.  Adding that additional character ("\r") presents a timing problem in that the character(s) preceding that have already been "in flight" down the USB, so it would be difficult with the existing architecture to be able to accommodate that (atomically) in the output.  I can say that although I wasn't able to fix it, at least I had the satisfaction of knowing I was able to pinpoint the exact cause of the misbehavior, and just lacked the knowledge of USB to handle it adequately. Ultimately, the times I would be dialing in for just the terminal mode would be really rare (would much more often dial in to use PPP instead), so it really wasn't worth the effort to fix it. It was simply easier to adjust the terminal program to add a LF for each received CR for a call to my system.

1For you youn'uns, this was in the days where the mainboard had lots of Industry Standard Architecture (ISA) slots, and very little was on the mainboard.  Your parallel printer and serial ports were on one card, your graphics controller on another, your floppy controller on another, and your hard disk controller on another.  This had the advantage of functional isolation--if your floppy controller went bad, you could swap out just that; these days, it's all wrapped up into the mainboard chipset, and the whole mainboard has to be replaced.  It also allowed the system builder the ability to engineer the system better.  For example, if the standard two serial ports was not enough, just get a card with say eight ports instead.  But I digress; the point is, hard disks were relatively expensive back then, and most systems had only floppies.

Direct all comments to Google+, preferably under the post about this blog entry.

English is a difficult enough language to interpret correctly when its rules are followed, let alone when the speaker or writer chooses not to follow those rules.

"Jeopardy!" replies and randomcaps really suck!

Please join one of the fastest growing social networks, Google+!