06 March, 2017

Xorg Voodoo, and It Really Pays to Practice Your Backups and Restores

I wish Xorg/Wayland/Weston was not so much black magic voodoo juju.

For some experimentation and hacking fun, I installed gdm3 on my Xubuntu desktop system.  I set up gdm as my default DM (dpkg-reconfigure lightdm and select gdm from the list), and then I ran systemctl stop lightdm and systemctl start gdm.  That was somewhat of a visual shock, because I had never run gdm3 before, but nonetheless, it was usable.  I logged in as my normal user.

I had some "normal" logins, where GDM started up my "normal" Xfce session.  Then I decided, I wanted to see if the "Weston" option worked, as it had not under LightDM.  Shazam, whatever GDM does that LightDM doesn't, I don't know, but that entry worked.  Likewise I fiddled a little with "GNOME on Wayland", which was interesting.  It's the first time I've ever used (I think it's called) the Lens.  Meh.  It's OK I guess, but I miss my menus of applications and such.  I don't like the Lens so much.

One of the first things I noticed is, "log out" was not part of the dialog like it was under Xfwm/LightDM.  There was only poweroff, reboot, and suspend.  Huh?  That seems kind of weird.  Eventually, I found out (don't remember where) that there was a separate logout option.  Then I hopped on over to tty1, and did systemctl stop gdm (might have had a 3 too).  Then systemctl start gdm.  Wow, that's really weird.  GDM didn't start, but it looked like several times per second, it was trying.  It was even difficult to type systemctl stop gdm because a couple of times a second, input was being stolen by the process trying to start GDM (or Xorg, not sure which).  In fact, I don't know what the deuce was going on, but I could start neither gdm nor lightdm.

At this point I reasoned, I had seen the systemd file for lightdm and remembered it had a test for the default display manager.  I would have figured dpkg-reconfigure would just use systemctl enable and systemctl disable because it "knows" the list of DMs, but it writes /etc/X11/default-display-manager anyway.  Okeydokey, I did anoher dpkg-reconfigure lightdm and selected lightdm.  That still wouldn't start either.  Well, neither would gdm, so I rebooted.

The first surprise came when LightDM started really soon.  I had set up lightdm to be disabled, because I want all the stuff which happens at boot to settle down first, then start the display server.  I do this in rc.local by backgrounding a shell script which sleeps 20 seconds then does the appropriate thing for the service.  It used to use an appropriate Upstart command, but of course when upgrading Xubuntu LTS 14 to 16, it had to be updated to use systemctl(8) instead.  But it seems the dpkg-reconfigure had undone any enable or disable, since I had not even gotten the prompt on tty1 before the VT was changed to tty7 to start the X server.  Meh, OK, I recognized this and just disabled lightdm.

I had fun experimenting with starting Weston (like loading different modules in the [core] section).  One thing that didn't work too well was using drm-backend.so for Weston.  That not only killed Weston, but also whacked Xorg too.  That got a little whacky in that I had problems after that switching VTs.  I had to log into another host on my network, SSH to the workstation, and systemctl reboot.  After all, a computer isn't particularly useful if you can't type at it, if that's the way you normally give input to it.

I got tired of going through dpkg-reconfigure to switch DMs, so I just edited /etc/X11/default-display-manager directly.  That seemed to be OK, but eventually, I got to a point where GDM wouldn't start, and LightDM wouldn't start either.  Huh, that's weird.  So I restarted the whole system.

Then there was the chilling realization that systemctl start lightdm did not do a whole lot except throw errors I could not understand in to the systemctl status lightdm and journalctl outputs, like stuff about some assertion failing.  I'm sure if I wanted to take the time to download the ENTIRE SOURCE package for Xorg, I might see what that assertion does, and why its failure was happening, but I was not about to take all that time to futz around with that. What I thought might have helped is, I have an Xorg "prestart" script which sets the screen saver timeout and DPMS, changes the root window background color so that I know Xorg is running but before LightDM can initialize, and use some xrandr commands to set up the resolutions and refresh rates of the two framebuffers/monitors.  (Xorg cannot read EDID information because the switches through which both monitors are connected mangle EDID, so it uses defaults...and that's just really ugly.)  While I was writing that prestart script, I redirected stderr to an unused tty.  All I got on that VT was messages about "can't open display."  In retrospect, what I really should have peeked at was /var/log/Xorg.0.log for clues, but it would not necessarily have revealed anything I could understand.

I tried another dpkg-reconfigure to make sure whatever needs to be done to switch DMs is done, figuring it might be more than just rewriting the /etc/X11/default-display-manager file.  That, unfortunately, was no help whatsoever.  Restarting the system did not help either.  I remove/purge'd the gdm3 package; no help.  I reinstalled the lightdm package; that wasn't any help either.  Sigh.  It was going to be a really bad day if the only thing which is going to get my dailly driver back is a Xubuntu reinstallation and reconfiguration.  At least the vast, vast majority of my personal settings and data is on a separate /home logical volume.  I could very likely keep all the logical volumes, filesystems (but remade filesystems, except for /home of course) and stuff, so it wouldn't be like a blank disk installation.  I have to imagine there will unfortunately be a somewhat large portion of *buntu users where that would be their only option because they're just not that experienced or learned in operations at this level.  Most folks don't need it because their systems just work, they get their work done, and the amount of experimentation, especially at the system level, is minimal.

Next I did something I do very rarely, which is select the entry for system recovery at system boot.  I figured I needed as little as possible running for what I was about to do next.  Ugh.  That is really ugly because of the nomodeset option.  I am really, really used to the VTs coming up 1920x1080 (or 240x67).  So, I restarted and edited the default entry instead, adding "single" to the end of the kernel command line.  I figured pretty much all the configuration is held in /etc, so I figured out which disk and logical volume I used for backups last (which was right around midnight Sunday, started it up and went to bed) and mounted it.  Then I did rsync -av --delete /mnt/bkup/thishost/etc/. /etc/. to get the /etc directory back to how it was.  That went really quickly, as you can imagine.  Then I just hit Ctrl-Alt-Del.  That's of course going to umount the LV on the USB disk, deactivate all the USB LVs, everything buttoned up and ready to restart.

Except that didn't help either.  I even tried unplugging my computer for a while figuring it was some really weird juju with how the video controller was being initialized..hoping letting the capacitors discharge would unstick this lack of Xorg starting.  No, as I could have predicted, that really wasn't it either.

The semi-weird thing is, while logged in as the superuser on tty1, I could run Xorg :0 just fine.  Of course, that's not particularly useful, but at least it proved it was not a hardware denial, or corrupted driver .so'es, or something like that.  The X server itself would start, it's just that lightdm couldn't start it and use it.  Well...come to think of it, the screen was initialized to all black, not the gray dot pattern it usually does, and no big X cursor appeared.  Not sure what was up with that.  At least it didn't go, as it sometimes will when it's failing, to VT 7 and do nothing but leave the blinking text mode cursor there.

I was getting really discouraged (and a little panicked to be honest) at this point.  I thought it was going to be hours before being up and running again.  I was starting to think of, how am I going to fetch the ISO to do another installation?  Can I get one effectively with one of my other systems, likely with Lynx?  I mean, as IT disasters go, this is pretty mild because at least there is a "known way out" (namelly OS reinstallation) which is nearly guaranteed to get the blasted thing working again.  It's just the thought of the long, long time it was going to take to make that happen, with all the work that would need to be done in terms of installing the packages I like which basically has to happen after the standard installation was finished.  It could be a lot worse; it could be the CPU itself which doesn't work, and I'd have to go back to a LOT slower machine (from a Core 2 Duo to a Pentium IV).

Sigh.  OK, I wasn't too sure about using my complete backup.  I do a number of --exclude=  directives when I do the backups.  But I'm never quite sure if I am excepting enough.  For example, it'd probably be less than a good result if the LVM information was overwritten (so actually, that's already excluded).  And sometimes the presence of files can make a difference, so of course you're going to have to use the --delete directive.  I'm thinking, if this obliterates the wrong things, it's going to be a long, arduous reinstallation process, but hey, it's at least worth a try to do a full restore.  After all, like YouTuber AvE often says, if it's broken, how can it hurt a whole lot to break it some more?  Worst thing that happens is, my restore methodology overwrites zeroes over everything, and I have to reinstall everything anyway.  Surely it will take not a whole lot of time to MUNG things to the point where OS reinstallation becomes a certainty.

So with some trepidation, in single user mode again, I mounted up the last backup, but I was still unsure of what I was about to mangle, so I added the --dry-run option to rsync.  And boy am I glad I did.  When you go about deleting things like lost+found, and bad things™ happen, even worse things tend to happen when fsck is trying to set things right and it can't write to lost+found because it's not there.  It's also not particularly useful to go mucking about in /sys or /proc.  I definitely didn't want to get into a loop trying to do untoward things with /mnt/bkup/thishost so I knew enough to mount the backup read-only, but still figured out what I really wanted to do is exclude everything under /mnt.  I also chose to exclude everything under my $HOME but it would still be possible that some of the session files under there could screw with logging in under Xfce (or who knows, one time I got auto-logged into Weston when I didn't mean to, it must have stuck as the last thing I tried in the greeter).

So eventually I settled on a pretty significant set of --exclude's and let it rip.  As I had been experimenting with --dry-run a number of times, there already was significant information in the block cache that really, it was only a few minutes later that rsync said it was finished.  I restarted, unplugged the backup disk's PSU while the BIOS screen was showing (yep, it's that old, not UEFI), and let GrUB do its thing.  And...


I killed my little delayed DM starter script, did systemctl start lightdm, and the system once again looked normal.  Of course, since the system is on a conventional SATA disk (not an SSD), it took agonizingly long to initialize, but I knew things were likely going to work OK because I got my normal prompt from ssh-agent to enter in the passphrase for my private keys in an XTerm.

What I'd really, really like to know is, what caused LightDM not to be able to start Xorg?  That's the voodoo juju part of all this.  You'd really hope that something particularly helpful woud be in the journalctl output, or systemctl status.  But alas, no help was forthcoming.  These days, if you don't have a working graphics environment where you can run a browser with JavaScript capabilities, lamentably you're at quite a disadvantage in researching possible causes and remedies.  The usual copying/pasting of an error message into a Google search is going to be quite difficult.

Long and the short of it is, it's really a particularly good idea to practice restores every now and again.  It will point out deficiencies in either your backup or your restore methodology, or maybe both.  In any case, with such practice, it shoud speed up recovery from being in a jam.

Direct all comments to Google+, preferably under the post about this blog entry.

English is a difficult enough language to interpret correctly when its rules are followed, let alone when the speaker or writer chooses not to follow those rules.

"Jeopardy!" replies and randomcaps really suck!

Please join one of the fastest growing social networks, Google+!