01 January, 2012

One Week's Adventure into Getting Accurate Time

Unfortunately, for a while at the company where I work, they had a falseticker in their ensemble of  NTP sources, one which was "broadcasting" a time in mid-October (for a real, actual time of late December).  What this seemed to do is slow down the apparent time of the PBX (it was behind), and occasionally our PBX NTP source would go into the "unsynchronized" state (stratum 16).  This would cause the PBX to slow down its idea of the progression of time, losing on the order of a second or more every hour.  Overnight, the time displayed on desktop phones would be 20 or more seconds behind.

Avaya PBXes are based on Red Hat Enterprise Linux these days, complete with ntpd.  Once the kernel clock had been slowed down, the LOCAL reference clock of course would be slow.  And due to our main NTP source (which we've been told is a router's address) going to stratum 16, the selection algorithm would often choose LOCAL.  What's somewhat worse is that the pair of systems, the active and the standby, are in a peer relationship.  When one of them chooses the LOCAL reference clock, often the other will choose its peer rather than the server (router) source.  Admittedly, the source was occasionally at stratum 16, but I wasn't understanding how a stratum 10 or 11 source could be chosen in preference to one at least most of the time claiming to be stratum 2.  The reachability bitmask from the ntpq output didn't look good either. I had just assumed that the report of stratum 16 was due to ntpd marking it so due to lack of good connectivity (that is to say, packets being dropped or something).

It's also true that the Windows domain under which we operate would be radically off correct time as well.  I can only imagine that the AD controllers likewise were having trouble believing the time the router was handing out, and therefore also started to drift either according to the kernel PLL/FLL or their hardware (CMOS) clock(s).

Eventually I noticed that instead of the router being at stratum 16 for a few minutes, it was staying there for over an hour and a half. It's then I started paying closer attention to the runs of ntpdate that I had doing to judge the offset between our NTP source and the PBX.  I had assumed the router was being marked stratum 16 because of dropped packets, but "ntpdate  -d" shows the statistics right then and there, independent of everything else.  If an NTP source itself (such as that router) is declaring itself as stratum 16, ntpdate will show that directly, not due to some unreachabiltiy notion.  So I wrote an email to the Help Desk stating we could be having NTP troubles.

The network admin who would be responsible for that NTP source sits on the other side of a cubicle wall from me.  He likewise was concerned about connectivity/dropped packets, and frankly, I wasn't 100% convinced at the time the stratum 16 was coming solely as a result of the router declaring itself so; I was still maybe 5% unsure and that it could be, as he suggested, some packet drop problem.  So as he asked, I started some pings from our (Solaris) CMS system and the PBX.  When one was perfect (no missed packets at all) and the other missed only 2 out of 500 (negligible), he spent some time looking further.  He found the falseticker and removed it from the configuration.

Eventually, the routers were getting only good NTP times.  As is the very careful way of NTP, it took a few hours, but eventually the address we were told to use for NTP was not only synchronized, but also reported fewer and fewer times, and then no time at all at stratum 16.

The challenge now was how to get the active PBX to synchronize.  I was pretty sure it would not do so automatically because of the large, 25 second-ish difference between its LOCAL reference time and the router's (correct) reference time.  Remember, both the active server and its standby were slowed way down from being told to do so for days.  We really didn't want to step the time directly on the active, as this would mess up statistics being collected continuously (about ongoing calls, queue times, etc.).  So what I did in hopes of righting the ship so to speak is step the standby.  This in theory would be less risky, as it's doing what: just standing by!  The hope was to get the active to slew its clock forward, and eventually catch up.

Well...not so much.  Unfortunately, stepping the time in Avaya Communication Manager (CM) is kind of clumsey.  You authenticate to a Web server, and you submit a form whose resolution is only a minute.  What'd really be preferable is to be able to log on as the superuser ("root"), kill ntpd, use ntpdate to step the clock to something really, really close, and restart ntpd.  Not so with CM; it's not really easy to become superuser.  Sure, they have a PBX admin group called "superuser," but this isn't like POSIX-familiar people refer to that term.  CM has a very fine grained setup of sudo so that one can "sudo sat" for example to start the System Administration Terminal to run sat with the apparently requisite superuser (in the POSIX sense) permissions.

So, I was hoping being in that NTP peer relationship would have some weight when it came to keeping the active server at the correct time.  So I stepped the time best I could with the Web interface to about within a half second or so of actual time.  I thought, this is great.  In short order, everything will be back on track.

Not so.

What really happened is that due to the (in)frequency at which ntpd polls its sources, the LOCAL reference clock on the standby slipped to around a second and a half behind the router (now good) reference time.  So it still chose its peer, and stepped it further behind, back to 25 or 30 behind actual.  I thought, OK, I can do this; in watching the secondary (standby), I saw it "almost" choose the router's time because the LOCAL clock and it were almost the same after 10 or so polls.  But it had slid before a total "lock."  So the trick here must be a little anticipation, stepping the clock a COUPLE OF seconds ahead, so that by the time it "slid" backwards, it would be close enough to lock on.  It was a matter of sort of "leading the target."  So once more I stepped the clock with the Web interface, except a couple seconds "early."

That, as it turns out, worked fairly well.  Faster than I ever expected otherwise, the secondary locked on, and "persuaded" the primary to step its time forward, and also restart the NTP algorithms, thus polling more frequently for a while.  Unfortunately, it did not cause a slew as I had hoped, but at least by the time the standby's ntpd caused the primary's ntpd to step the time, it was "after hours" Eastern Time, therefore somewhat less disruptive.

Eventually, a few days later, it looks like all three of the router, active, and standby times are not only accurate but proceeding at the correct rate. It looked for a while to drop out of sync now and again, but over the long haul once the ticking rates got adjusted, everything seems to be humming along normally.   The FLL has transitioned back to PLL I'm pretty certain.

The lesson of the story is, sometimes you have to be pretty good at anticipation in order to hone in on a target.

English is a difficult enough language to interpret correctly when its rules are followed, let alone when the speaker or writer chooses not to follow those rules.

"Jeopardy!" replies and randomcaps really suck!

Please join me on Google+!