06 March, 2017

Xorg Voodoo, and It Really Pays to Practice Your Backups and Restores

I wish Xorg/Wayland/Weston was not so much black magic voodoo juju.

For some experimentation and hacking fun, I installed gdm3 on my Xubuntu desktop system.  I set up gdm as my default DM (dpkg-reconfigure lightdm and select gdm from the list), and then I ran systemctl stop lightdm and systemctl start gdm.  That was somewhat of a visual shock, because I had never run gdm3 before, but nonetheless, it was usable.  I logged in as my normal user.

I had some "normal" logins, where GDM started up my "normal" Xfce session.  Then I decided, I wanted to see if the "Weston" option worked, as it had not under LightDM.  Shazam, whatever GDM does that LightDM doesn't, I don't know, but that entry worked.  Likewise I fiddled a little with "GNOME on Wayland", which was interesting.  It's the first time I've ever used (I think it's called) the Lens.  Meh.  It's OK I guess, but I miss my menus of applications and such.  I don't like the Lens so much.

One of the first things I noticed is, "log out" was not part of the dialog like it was under Xfwm/LightDM.  There was only poweroff, reboot, and suspend.  Huh?  That seems kind of weird.  Eventually, I found out (don't remember where) that there was a separate logout option.  Then I hopped on over to tty1, and did systemctl stop gdm (might have had a 3 too).  Then systemctl start gdm.  Wow, that's really weird.  GDM didn't start, but it looked like several times per second, it was trying.  It was even difficult to type systemctl stop gdm because a couple of times a second, input was being stolen by the process trying to start GDM (or Xorg, not sure which).  In fact, I don't know what the deuce was going on, but I could start neither gdm nor lightdm.

At this point I reasoned, I had seen the systemd file for lightdm and remembered it had a test for the default display manager.  I would have figured dpkg-reconfigure would just use systemctl enable and systemctl disable because it "knows" the list of DMs, but it writes /etc/X11/default-display-manager anyway.  Okeydokey, I did anoher dpkg-reconfigure lightdm and selected lightdm.  That still wouldn't start either.  Well, neither would gdm, so I rebooted.

The first surprise came when LightDM started really soon.  I had set up lightdm to be disabled, because I want all the stuff which happens at boot to settle down first, then start the display server.  I do this in rc.local by backgrounding a shell script which sleeps 20 seconds then does the appropriate thing for the service.  It used to use an appropriate Upstart command, but of course when upgrading Xubuntu LTS 14 to 16, it had to be updated to use systemctl(8) instead.  But it seems the dpkg-reconfigure had undone any enable or disable, since I had not even gotten the prompt on tty1 before the VT was changed to tty7 to start the X server.  Meh, OK, I recognized this and just disabled lightdm.

I had fun experimenting with starting Weston (like loading different modules in the [core] section).  One thing that didn't work too well was using drm-backend.so for Weston.  That not only killed Weston, but also whacked Xorg too.  That got a little whacky in that I had problems after that switching VTs.  I had to log into another host on my network, SSH to the workstation, and systemctl reboot.  After all, a computer isn't particularly useful if you can't type at it, if that's the way you normally give input to it.

I got tired of going through dpkg-reconfigure to switch DMs, so I just edited /etc/X11/default-display-manager directly.  That seemed to be OK, but eventually, I got to a point where GDM wouldn't start, and LightDM wouldn't start either.  Huh, that's weird.  So I restarted the whole system.

Then there was the chilling realization that systemctl start lightdm did not do a whole lot except throw errors I could not understand in to the systemctl status lightdm and journalctl outputs, like stuff about some assertion failing.  I'm sure if I wanted to take the time to download the ENTIRE SOURCE package for Xorg, I might see what that assertion does, and why its failure was happening, but I was not about to take all that time to futz around with that. What I thought might have helped is, I have an Xorg "prestart" script which sets the screen saver timeout and DPMS, changes the root window background color so that I know Xorg is running but before LightDM can initialize, and use some xrandr commands to set up the resolutions and refresh rates of the two framebuffers/monitors.  (Xorg cannot read EDID information because the switches through which both monitors are connected mangle EDID, so it uses defaults...and that's just really ugly.)  While I was writing that prestart script, I redirected stderr to an unused tty.  All I got on that VT was messages about "can't open display."  In retrospect, what I really should have peeked at was /var/log/Xorg.0.log for clues, but it would not necessarily have revealed anything I could understand.

I tried another dpkg-reconfigure to make sure whatever needs to be done to switch DMs is done, figuring it might be more than just rewriting the /etc/X11/default-display-manager file.  That, unfortunately, was no help whatsoever.  Restarting the system did not help either.  I remove/purge'd the gdm3 package; no help.  I reinstalled the lightdm package; that wasn't any help either.  Sigh.  It was going to be a really bad day if the only thing which is going to get my dailly driver back is a Xubuntu reinstallation and reconfiguration.  At least the vast, vast majority of my personal settings and data is on a separate /home logical volume.  I could very likely keep all the logical volumes, filesystems (but remade filesystems, except for /home of course) and stuff, so it wouldn't be like a blank disk installation.  I have to imagine there will unfortunately be a somewhat large portion of *buntu users where that would be their only option because they're just not that experienced or learned in operations at this level.  Most folks don't need it because their systems just work, they get their work done, and the amount of experimentation, especially at the system level, is minimal.

Next I did something I do very rarely, which is select the entry for system recovery at system boot.  I figured I needed as little as possible running for what I was about to do next.  Ugh.  That is really ugly because of the nomodeset option.  I am really, really used to the VTs coming up 1920x1080 (or 240x67).  So, I restarted and edited the default entry instead, adding "single" to the end of the kernel command line.  I figured pretty much all the configuration is held in /etc, so I figured out which disk and logical volume I used for backups last (which was right around midnight Sunday, started it up and went to bed) and mounted it.  Then I did rsync -av --delete /mnt/bkup/thishost/etc/. /etc/. to get the /etc directory back to how it was.  That went really quickly, as you can imagine.  Then I just hit Ctrl-Alt-Del.  That's of course going to umount the LV on the USB disk, deactivate all the USB LVs, everything buttoned up and ready to restart.

Except that didn't help either.  I even tried unplugging my computer for a while figuring it was some really weird juju with how the video controller was being initialized..hoping letting the capacitors discharge would unstick this lack of Xorg starting.  No, as I could have predicted, that really wasn't it either.

The semi-weird thing is, while logged in as the superuser on tty1, I could run Xorg :0 just fine.  Of course, that's not particularly useful, but at least it proved it was not a hardware denial, or corrupted driver .so'es, or something like that.  The X server itself would start, it's just that lightdm couldn't start it and use it.  Well...come to think of it, the screen was initialized to all black, not the gray dot pattern it usually does, and no big X cursor appeared.  Not sure what was up with that.  At least it didn't go, as it sometimes will when it's failing, to VT 7 and do nothing but leave the blinking text mode cursor there.

I was getting really discouraged (and a little panicked to be honest) at this point.  I thought it was going to be hours before being up and running again.  I was starting to think of, how am I going to fetch the ISO to do another installation?  Can I get one effectively with one of my other systems, likely with Lynx?  I mean, as IT disasters go, this is pretty mild because at least there is a "known way out" (namelly OS reinstallation) which is nearly guaranteed to get the blasted thing working again.  It's just the thought of the long, long time it was going to take to make that happen, with all the work that would need to be done in terms of installing the packages I like which basically has to happen after the standard installation was finished.  It could be a lot worse; it could be the CPU itself which doesn't work, and I'd have to go back to a LOT slower machine (from a Core 2 Duo to a Pentium IV).

Sigh.  OK, I wasn't too sure about using my complete backup.  I do a number of --exclude=  directives when I do the backups.  But I'm never quite sure if I am excepting enough.  For example, it'd probably be less than a good result if the LVM information was overwritten (so actually, that's already excluded).  And sometimes the presence of files can make a difference, so of course you're going to have to use the --delete directive.  I'm thinking, if this obliterates the wrong things, it's going to be a long, arduous reinstallation process, but hey, it's at least worth a try to do a full restore.  After all, like YouTuber AvE often says, if it's broken, how can it hurt a whole lot to break it some more?  Worst thing that happens is, my restore methodology overwrites zeroes over everything, and I have to reinstall everything anyway.  Surely it will take not a whole lot of time to MUNG things to the point where OS reinstallation becomes a certainty.

So with some trepidation, in single user mode again, I mounted up the last backup, but I was still unsure of what I was about to mangle, so I added the --dry-run option to rsync.  And boy am I glad I did.  When you go about deleting things like lost+found, and bad things™ happen, even worse things tend to happen when fsck is trying to set things right and it can't write to lost+found because it's not there.  It's also not particularly useful to go mucking about in /sys or /proc.  I definitely didn't want to get into a loop trying to do untoward things with /mnt/bkup/thishost so I knew enough to mount the backup read-only, but still figured out what I really wanted to do is exclude everything under /mnt.  I also chose to exclude everything under my $HOME but it would still be possible that some of the session files under there could screw with logging in under Xfce (or who knows, one time I got auto-logged into Weston when I didn't mean to, it must have stuck as the last thing I tried in the greeter).

So eventually I settled on a pretty significant set of --exclude's and let it rip.  As I had been experimenting with --dry-run a number of times, there already was significant information in the block cache that really, it was only a few minutes later that rsync said it was finished.  I restarted, unplugged the backup disk's PSU while the BIOS screen was showing (yep, it's that old, not UEFI), and let GrUB do its thing.  And...


I killed my little delayed DM starter script, did systemctl start lightdm, and the system once again looked normal.  Of course, since the system is on a conventional SATA disk (not an SSD), it took agonizingly long to initialize, but I knew things were likely going to work OK because I got my normal prompt from ssh-agent to enter in the passphrase for my private keys in an XTerm.

What I'd really, really like to know is, what caused LightDM not to be able to start Xorg?  That's the voodoo juju part of all this.  You'd really hope that something particularly helpful woud be in the journalctl output, or systemctl status.  But alas, no help was forthcoming.  These days, if you don't have a working graphics environment where you can run a browser with JavaScript capabilities, lamentably you're at quite a disadvantage in researching possible causes and remedies.  The usual copying/pasting of an error message into a Google search is going to be quite difficult.

Long and the short of it is, it's really a particularly good idea to practice restores every now and again.  It will point out deficiencies in either your backup or your restore methodology, or maybe both.  In any case, with such practice, it shoud speed up recovery from being in a jam.

Direct all comments to Google+, preferably under the post about this blog entry.

English is a difficult enough language to interpret correctly when its rules are followed, let alone when the speaker or writer chooses not to follow those rules.

"Jeopardy!" replies and randomcaps really suck!

Please join one of the fastest growing social networks, Google+!

02 March, 2017

Yesterday, I knocked myself off the Internet

Yesterday, I spent some time poking about the Actiontec MI424WR (Rev I) that Verizon supposedly provided "for free" as an incentive to subscribe to FiOS.  Supposedly, in order to be (fully) supported, you need a Verizon approved router, including one of these.  I'm not sure their site allows one to complete the online (service) order without agreeing to rent, purchase, or otherwise prove you have (or will have) one of Verizon's routers.  You may be able to finish the order these days, but as I recall, two years ago when I was ordering, their maze of forms and JavaScript wouldn't allow a submission without one.  Anyhow...I read a recent thread on DSL Reports with regards to residential class accounts being able to have static IP addresses (they don't allow that) and the workaround of using dynamic DNS services prompted me to start poking around to see what services (dyndns.org, noip.com, etc.) that the Verizon router supports directly.

I have VLANs set up on my switch, one for TWC/Spectrum WAN (although I don't subscribe to any of their services presently), one for my VOIP LAN, one for most of the rest of my LAN, one for FiOS WAN, and one for the FiOS LAN.  The nexus for everything is a PC running Linux functioning as a router.  I knew there was the possibility of an address "conflict" if I plugged in the WAN port on the Actiontec (because Verizon only allows one DHCP lease at a time) so initially I powered up the Actiontec with the WAN cable unplugged.

After puttering about with a lot of its settings (ugh, I hate the Actiontec Web interface), I'm not sure what possessed me, but I thought, hey, my Linux router has a DHCP lease, and since Verizon's systems will only allow one lease at a time, I thought plugging in the Actiontec WAN cable should be no problem.  If it tries to obtain a lease it will just be denied, whether by DHCPNAK or just timing out.

Emmmm....wrong!  Very shortly after plugging in the WAN cable, the "Internet" LED came on.  First I thought, "wait, what?"  That was shortly followed by "oh, crap!"  Sure enough, I logged onto my "production" router, tried the usual "ping", and there were no replies whatsoever.  There isn't anything of consequence connected to the LAN ports of the Actiontec; it was pretty much just connected so that I could get in to configure it, and possibly switch things up a bit if a Verizon TSR demanded to have it online.  So either the lease which Linux had obtained was somehow "transferred" to the Actiontec, or the lease Linux had was invalidated, and at any rate, in that state the Linux router was of no (WAN) use.  (It still routed just fine between all the LANs.)  I basically knocked myself off the Internet, because nothing on my network is set up to operate through the Actiontec.  I thought, you idiot, you should have logged onto the switch and issued "shutdown" to the interface for the Actiontec WAN port first.

As you may gather from some of my previous postings, here on the I Heart Libertarianism blog or on Google+, I get pretty anxious about not having Internet connectivity, so to lift a line from Dickens, this was not the best of times.  I think this is mostly because I have the family's email server here, not to mention virtually all the important notifications I have would go to a philipps.us or joe.philipps.us address.  It's also the DNS master for a number of my domains, including philipps.us.  I know, I know...the TTLs on the SOA records themselves should make them valid for two weeks, so even without Internet for an extended-ish time, things should not fall apart entirely.

Email servers very typically keep retrying for several days, maybe even as much as a week, so that should not be so terrible.  As a further mitigation of any failure of my email server here, it just so happens I was one of the people who got in on the "ground floor" when Google was beta testing Google Apps (the Web services, not the usual meaning these days of the apps to access Google on Android). As a consequence, I have a "no cost" G Suite configuration as a less preferred MX.  Therefore, it would be somewhat messy from an email history standpoint, but a catchall account on G Suite would have any email which my setup cannot suck in.  Still...I think it's the thought that without Internet, even that backup setup is no good because I can't get to it.  I would have to "borrow" someone else's Internet access even to see what's over at my G Suite account.

This would be compounded by the fact that these days, many of my access passwords are utter gibberish, thanks to KeePass and KeePassX.  The database is in my Google Drive, but also backed up on my local computer.  The implication is, it's another one of those "bootstrap" problems, without Internet, I don't have access to the master KeePass database, and even if I work from the copy, say from a computer at the Erie County library, it's going to be a LOT of tedious typing because the library's computers are likely not going to be able to run the KeePass software.  I'd be working with revealing the decrypted passwords on KeePassDroid on my Nexus 7.  For any Web services which will accept it, I turn on basically everything printable except space for the KeePass generator, and typically 20 characters.  So yeah....lots of tedious typing if I have to use another computer.

Despite the minor panic I was in, I thought, come on, this shouldn't be that difficult, you really should have a way out of this.  You can try ifdown on the Internet interface (happens to be eth3), followed by ifup.  Nope, that didn't really do anything.  Just calm down a little, and work the problem.  If you get back on the Actiontec, you should be able to pick and prod your way around it, and find the "release DHCP lease" button, which you know is in there somewhere.  That would at least mollify Verizon's backend(s) (or the ONT) into letting Linux get a usable address again.  That was in fact the key.  After hitting "release" on the Actiontec, I was able to ifdown/ifup one more time, and Linux got an address/lease.  However...it was not the IPv4 address I had before.  Rats.

As mentioned, the whole exercise started with wondering about Actiontec's implementation of dynamic DNS.  This is precisely what I needed to do.  This happens so infrequently that I have a Google Keep checklist for IPv4 address changes (which I have exported to a Google Doc for linking in this blog entry).  I have that accessible on my Nexus 7, so all I have to do is find it on there, and I'm good to go.  I copied the list, renamed it with "1-Mar-2017" in the title, and went about executing its items.

There were items on the checklist that I still had to figure out on the spot.  For example, for some of the items, I did not know the pathnames of what needed changing, or what item in the relevant file.  So in a sense, it's good this happened, because it has made me refine the process and therefore improve it.  Still, it's a pain whenever my address changes.  Some of it could probably be scripted or automated, but it's one of those things that happens so infrequently, I have to wonder how much utility there is in writing anything.

Anyhow...obviously, I'm back online, or I couldn't be posting this.  Hopefully I'll be better prepared for the next time my address changes.

Direct all comments to Google+, preferably under the post about this blog entry.

English is a difficult enough language to interpret correctly when its rules are followed, let alone when the speaker or writer chooses not to follow those rules.

"Jeopardy!" replies and randomcaps really suck!

Please join one of the fastest growing social networks, Google+!