Re: clock stops, U5/10 Solaris 8

From: Brian Utterback (Brian.Utterback_at_Sun.removeme.COM)
Date: 07/14/03


Date: Mon, 14 Jul 2003 11:13:13 -0400
To: Andy Lennard <andy@kontron.demon.co.uk>

It sounds like the hardware TOD clock is stuck. The next time this happens try running
the following script:

adb -k /dev/ksyms /dev/mem <<EOF
clock_adj_hist/4E
adj_hist_entry/D
lbolt/D
lbolt64/E
EOF

If the values in the array "clock_adj_hist" are relatively close together (less than 5000
apart), then the problem is very likely a stuck TOD clock.

You can set the parameter "tod_validate_enable" to enable hardware TOD validation. Just
add this line to /etc/system:

set tod_validate_enable = 1

You asked for a description of the interaction of the hardware TOD clock and the in kernel
software clock. Okay, here goes:

As you surmised, the hardware clock is used at boot time to set the initial value of the
software clock. However, this is not the last time it is used. It is also used to double
check the value of the software clock. To understand this, you need to understand the
way the clock is implemented in the kernel.

The software clock is simply a counter that is incremented each "tick". The tick is a
a programmable oscillator, generally programmed to induce an interrupt each 100th of
a second. In the interrupt handling routine, all of the periodic system maintenance
occurs, including incrementing the clock tick.

There is a problem with this setup, however. The tick is not the highest level interrupt.
If a higher priority interrupt is being serviced when the tick fires, the tick is masked.
Unlike other interrupts, the tick does not queue, it is just dropped. Thus the kernel clock
will run slow, but not in a uniform way.

To counteract this, the kernel compares the software clock and the hardware clock to see if
they match. If they don't, then the software clock is reset to match the hardware clock,
since the hardware clock does not have the same problem of losing ticks.

Whenever the software clock is "set" by an external means such as the settimeofday call, the
value of the hardware clock is reset to match the software clock.

This whole process is further complicated by the fact that the hardware clock has a one second
resolution and the software clock has a nanosecond resolution. This simple fact is at the heart
of many of the bugs encountered by this setup. Since the minimum possible difference is one
second, we would like to fix the software clock when the two clocks are apart by one second.
But because of the discrete nature of the hardware clock, it is possible to detect a one second
difference when they really are only one nanosecond apart, but not detect any difference if they
differ by as much as 1.999999 seconds. So,to be sure that they are at least one second apart,
we look for a numeric difference of at least 2 seconds. Thus, any Solaris system, just sitting
there, idle, might experience a 2 second clock jump forwards or backwards at any time. This
the symptom of the bug that Paul mentioned earlier. However, it is not the actual jump that is
the bug, but the frequency of the jumps that is the real bug.

This all is made even worse with the advent of SunFire line of systems. These systems only maintain
a single hardware clock for all of the domains, with an "offset" stored for each domain. Keeping
all of the possible discrete transitions and reading and writing straight led to a number of bugs
in this code.

However, there is hope in sight. As of Solaris 8, the tick handling routines where changed to
use kernel cyclic timers instead of an interrupt handler. Kernel cyclic timers are not subject
to losing "ticks". So, starting in Solaris 8, the kernel clock should no longer run slow. It
may or may not keep better time than the hardware clock, but they should be on a par now, with
no reason to prefer one over the other.

Andy Lennard wrote:
> In message <3F11F23E.E9A333AC@ntlworld.com>, Dr. David Kirkby
> <drkirkby@ntlworld.com> writes
>
>>Jim Prescott wrote:
>>
>>>[crossposted comp.unix.solaris,comp.sys.sun.hardware]
>>>
>>>A couple times a month the clock on one of our systems stops. Actually
>>>it gets stuck in a 3 second loop. Eg:
>>> Thu Jul 10 08:21:10 EDT 2003
>>> Thu Jul 10 08:21:08 EDT 2003
>>> Thu Jul 10 08:21:09 EDT 2003
>>
>>>We were all set to just write it off as a broken machine when suddenly
>>
>>If a patch does not cure it (as others have suggested it might), you
>>may think it worth the time/effort to replace the NVRAM chip. It
>>depends on how much you value your time (excuse the pun), but that is
>>the hardware device that keeps the time. They cost little and are in
>>sockets that make them easy to remove and re-program. If you do this,
>>take a note of the mac address and hostid before doing the swap.
>>There's a good FAQ on the web on the nvram chip.
>>
>>I've nothing to confirm that chip would solve the problem, but it is
>>by far the most likely cause IF it's a hardware problem.
>
>
> That's interesting. I'd always assumed that current time was held within
> 'the kernel' somewhere, and that the NVRAM was only read at boot time,
> and written to maintain consistency. Now oddly interested, does anyone
> have any pointers to a description of how the OS interacts with the
> NVRAM clock timer? Thanks.
>

-- 
blu
Brian's 12th rule of support: Supporting any technology
           that has something called an "oid", will hurt.
--------------------------------------------------------------------------------
Brian Utterback - Solaris Sustaining (NFS/Naming) - Sun Microsystems Inc.,
Ph/VM: 781-442-1343, Em:brian.utterback-at-ess-you-enn-dot-kom


Relevant Pages

  • Re: Get time & milliseconds.
    ... The OAL has the implementation for GetSystemTime and SetSystemTime, so it can be adjusted to set and get it in any manner you desire, whether from hardware, NTP, a GPS receiver, an atomic clock or a sun dial is completely up to the OEM. ... Then when I need SystemTime with ms, ... adapt system tick to compensate drift. ...
    (microsoft.public.windowsce.app.development)
  • Re: Coding style, wait statement, sensitivity list and synthesis.
    ... >> least assume that XST supports this style and create this hardware. ... >>>then the higher precedence clock must be coded first. ... For these very rare dual-edge sensitive register, yes, I think instantiation ... Dual-edge sensitive register elements is not common practice in hardware design. ...
    (comp.lang.vhdl)
  • Re: [PATCH 06/23 -v8] handle accurate time keeping over long delays
    ... which is the same as the underlying hardware counter. ... So you are saying that you can trivally make it work with a clock that is, ... will never change while I am in the preempt disabled code. ... it's called from an IPI running on each CPU. ...
    (Linux-Kernel)
  • Re: Using different timebase for ntpd
    ... modified Ethernet devices to support hardware time stamping. ... was doing hardware gated ntp packets in the 90s. ... hardware timestamping is available from us. ... software timestamping using a hardware clock. ...
    (comp.protocols.time.ntp)
  • Re: timing a method
    ... The old DOS clock was, and is, a software clock. ... by code executed on hardware interrupt. ... resolution of one second, though the same original chip can generate ...
    (alt.comp.lang.borland-delphi)