Precision Time and Frequency Synchronization Using Modified Kernels 1. Introduction This memo describes replacements for certain SunOS and Ultrix kernel routines that manage the system clock and timer functions. They provide improved accuracy and stability through the use of a disciplined clock interface for use with the Network Time Protocol (NTP) or similar time- synchronization protocol. In addition, for certain models of the DECstation 5000 product line, the new routines provide improved precision to +-1 microsecond (us) (SunOS 4.1.1 already does provide precision to +-1 us). The current public NTP distribution cooperates with these kernel routines to provide synchronization in principle to within a microsecond, but in practice this is limited by the short-term stability of the oscillator that drives the timer interrupt. This memo describes the principles behind the design and operation of the software. There are two versions of the software, one that operates with the SunOS 4.1.1 kernel and the other that operates with the Ultrix 4.2a kernel (and probably the 4.3 kernel, although this has not been tested). A detailed description of the variables and algorithms is given in the hope that similar improvements can be incorporated in Unix kernels for other machines. The software itself is not included in this memo, since it involves licensed code. Detailed instructions on where to obtain it for either SunOS or Ultrix will be given separately. The principle function added to the SunOS and Ultrix kernels is to change the way the system clock is controlled, in order to provide precision time and frequency adjustments. Another function utilizes an undocumented counter in the DECstation hardware to provide precise time to the microsecond. This function can be used only with the DECstation 5000/240 and possibly others that use the same input/output chipset. 2. Design Principles In order to understand how these routines work, it is useful to consider how most Unix systems maintain the system clock. In the original design a hardware timer interrupts the kernel at some fixed rate, such as 100 Hz in the SunOS kernel and 256 Hz in the Ultrix kernel. Since 256 does not evenly divide the second in microseconds, the kernel inserts 64 us once each second so that the system clock stays in step with real time. The time returned by the gettimeofday() routine is thus characterized by 255 advances of 3906 us plus one of 3970 us. Also in the original design it is possible to slew the system clock to a new offset using the adjtime() system call. To do this the clock frequency is changed by adding or subtracting a fixed amount (tickadj) at each timer interrupt (tick) for a calculated number of ticks. Since this calculation involves dividing the requested offset by tickadj, it is possible to slew to a new offset with a precision only of tickadj, which is usually in the neighborhood of 5 us, but sometimes much higher. In order to maintain the system clock within specified bounds with this scheme, it is necessary to call adjtime() on a regular basis. For instance, let the bound be set at 100 us, which is a reasonable value for NTP-synchronized hosts on a local network, and let the onboard oscillator tolerance be 100 ppm, which is a reasonably conservative assumption. This requires that adjtime() be called at intervals not exceeding 1 second (s), which is in fact what the unmodified NTP software daemon does. In the modified kernel routines this scheme is replaced by another that extends the low-order bits of the system clock to provide very precise clock adjustments. At each timer interrupt a precisely calibrated time adjustment is added to the composite time value and overflows handled as required. The quantity to add is computed from the adjtime() call and, in addition a frequency adjustment, which is automatically calculated from previous time adjustments. This implementation operates as an adaptive-parameter, first-order, type-II, phase-lock loop (PLL), which in principle provides precision control of the system clock phase to within +-1 us and frequency to within +-5 nanoseconds (ns) per day. This PLL model is identical to the one implemented in NTP, except that in NTP the software daemon has to simulate the PLL using only the original adjtime() system call. The daemon is considerably complicated by the need to parcel time adjustments at frequent intervals in order to maintain the accuracy to specified bounds. The kernel routines do this directly, allowing vast gobs of ugly daemon code to be avoided at the expense of only a small amount of new code in the kernel. In fact, the amount of code added to the kernel for the new scheme is about the amount removed for the old scheme. The new adjtime() routine needs to be called only as each new time update is determined, which in NTP occurs at intervals of from 64 s to 1024 s. In addition, doing the frequency correction in the kernel means that the system time runs true even if the daemon were to cease operation or the network paths to the primary reference source fail. Note that the degree to which the adjtime() adjustment can be made is limited to a specific maximum value, presently +-128 milliseconds (ms), in order to achieve microsecond resolution. It is the intent in the design that settimeofday() be used for changes in system time greater than +-128 ms. It has been the Internet experience that the need to change the system time in increments greater than +-128 milliseconds is extremely rare and is usually associated with a hardware or software malfunction. Nevertheless, the limit applies to each adjtime() call and it is possible, but not recommended, that this routine is called at intervals smaller than 64 seconds, which is the NTP lower limit. For the most accurate and stable operation, adjtime() should be called at specified intervals; however, the PLL is quite forgiving and neither moderate loss of updates nor variations in the length of the interval is serious. The current engineering parameters have been optimized for intervals not greater than about 64 s. For larger intervals the PLL time constant can be adjusted to optimize the dynamic response up to intervals of 1024 s. Normally, this is automatically done by NTP. In any case, if updates are suspended, the PLL coasts at the frequency last determinated, which usually results in errors increasing only to a few tens of milliseconds over a day. The new code needs to know the initial frequency offset and time constant for the PLL, and the daemon needs to know the current frequency offset computed by the kernel for monitoring purposes. This is provided by a small change in the second argument of the kernel adjtime() calling sequence, which is documented later in this memo. Ordinarily, only the daemon will call the adjtime() routine, so the modified calling sequence is easily accommodated. Other than this change, the operation of adjtime() is transparent to the original. In the DECstation 5000/240 and possibly other models there happens to be an undocumented hardware register that counts system bus cycles at a rate of 25 MHz. The new kernel routines test for the CPU type and, in the case of the '240, use this register to interpolate system time between hardware timer interrupts. This results in a precision of +-1 us for all time values obtained via the gettimeofday() system call. This routine calls the kernel routine microtime(), which returns the actual interpolated value, but does not change the kernel time variable. Therefore, other kernel routines that access the kernel time variable directly and do not call either gettimeofday() or microtime() will continue their present behavior. The new kernel routines include provisions for error statistics (maximum error and estimated error), leap seconds and system clock status. These are intended to support applications that need such things; however, there are no applications other than the time-synchronization daemon itself that presently use them. At issue is the manner in which these data can be provided to application clients, such as new system calls and data interfaces. While a proposed interface is described later in this memo, it has not yet been implemented. This is an area for further study. While any time-synchronization daemon can in principle be modified to use the new code, the most likely will be users of the xntp3 distribution of NTP. The code in the xntp3 distribution determines whether the new kernel code is in use and automatically reconfigures as required. When the new code is in use, the daemon reads the frequency offset from a file and provides it and the initial time constant via adjtime(). In subsequent calls to adjtime(), only the time adjustment and time constant are affected. The daemon reads the frequency from the kernel (returned as the second argument of adjtime()) at intervals of one hour and writes it to the file. 3. Technical Description Following is a technical description of how the new scheme works in terms of the variables and algorithms involved. These components are discussed as a distinct entity and do not involve coding details specific to the Ultrix kernel. The algorithms involve only minor changes to the system clock and interval timer routines, but do not in themselves provide a conduit for application programs to learn the system clock status or statistics of the time-synchronization process. In a later section a number of new system calls are proposed to do this, along with an interface specification. The new scheme works like the companion simulator called kern.c and included in this directory. This stand-alone simulator includes code fragments identical to those in the modified kernel routines and operates in the same way. The system clock is implemented in the kernel using a set of variables and algorithms defined below and in the simulator. The algorithms are driven by explicit calls from the synchronization protocol as each time update is computed. The clock is read and set using the gettimeofday() and settimeofday() system calls, which operate in the same way as the originals, but return a status word describing the state of the system clock. Once the system clock has been set, the adjtime() system call is used to provide periodic updates including the time offset and possibly frequency offset and time constant. With NTP this occurs at intervals of from 64 s to 1024 s, deending on the time constant value. The kernel implements an adaptive-parameter, first-order, type-II, phase-lock loop (PLL) in order to integrate this offset into the phase and frequency of the system clock. The kernel keeps track of the time of the last update and adjusts the maximum error to grow by an amount equal to the oscillator frequency tolerance times the elapsed time since the last update. Occasionally, it is necessary to adjust the PLL parameters in response to environmental conditions, such as leap-second warning and oscillator stability observations. While the interface to do this has not yet been implemented, proposals to to that are included in a later section. A system call (setloop()) is used on such occasions to communicate these data. In addition, a system call (getloop())) is used to extract these data from the kernel for monitoring purposes. All programs utilize the system clock status variable time_status, which records whether the clock is synchronized, waiting for a leap second, etc. The value of this variable is returned by each system call. It can be set explicitly by the setloop() system call and implicitly by the settimeofday() system call and in the timer-interrupt routine. Values presently defined in the header file timex.h are as follows: int time_status = TIME_BAD; /* clock synchronization status */ #define TIME_UNS 0 /* unspecified or unknown */ #define TIME_OK 1 /* operation succeeded */ #define TIME_INS 1 /* insert leap second at end of current day */ #define TIME_DEL 2 /* delete leap second at end of current day */ #define TIME_OOP 3 /* leap second in progress */ #define TIME_BAD 4 /* system clock is not synchronized */ #define TIME_ADR -1 /* operation failed: invalid address */ #define TIME_VAL -2 /* operation failed: invalid argument */ #define TIME_PRV -3 /* operation failed: priviledged operation */ In case of a negative result code, the operation has failed; however, some variables may have been modified before the error was detected. Note that the new system calls never return a value of zero, so it is possible to determine whether the old routines or the new ones are in use. The syntax of the modified adjtime() is as follows: /* * adjtime - adjuts system time */ #include <sys/timex.h> int gettimexofday(tp, fiddle) struct timeval *tp; /* system time adjustment*/ struct timeval *fiddle; /* sneak path */ On entry the "timeval" sneak path is coded: struct timeval { long tv_sec = time_constant; /* time constant */ long tv_usec = time_freq; /* new frequency offset */ } However, the sneak is ignored if fiddle is the null pointer and the new frequency offset is ignored if zero. The value returned on exit is the system clock status defined above. The "timeval" sneak path is modified as follows: struct timeval { long tv_sec = time_precision; /* system clock precision */ long tv_usec = time_freq; /* current frequency offset */ } 3.1. Kernel Variables The following variables are used by the new code: long time_offset = 0; /* time adjustment (us) */ This variable is used by the PLL to adjust the system time in small increments. It is scaled by (1 << SHIFT_UPDATE) in binary microseconds. The maximum value that can be represented is about +-130 ms and the minimum value or precision is about one nanosecond. long time_constant = SHIFT_TAU; /* pll time constant */ This variable determines the bandwidth or "stiffness" of the PLL. It is used as a shift, with the effective value in positive powers of two. The optimum value for this variable is equal to 1/64 times the update interval. The default value SHIFT_TAU (0) corresponds to a PLL time constant of about one hour or an update interval of about one minute, which is appropriate for typical uncompensated quartz oscillators used in most computing equipment. Values larger than four are not useful, unless the local clock timebase is derived from a precision oscillator. long time_tolerance = MAXFREQ; /* frequency tolerance (ppm) */ This variable represents the maximum frequency error or tolerance of the particular platform and is a property of the architecture. It is expressed as a positive number greater than zero in parts-per-million (ppm). The default MAXFREQ (100) is appropriate for conventional workstations. long time_precision = 1000000 / HZ; /* clock precision (us) */ This variable represents the maximum error in reading the system clock. It is expressed as a positive number greater than zero in microseconds and is usually based on the number of microseconds between timer interrupts, in the case of the Ultrix kernel, 3906. However, in cases where the time can be interpolated between timer interrupts with microsecond resolution, the precision is specified as 1. This variable is computed by the kernel for use by the time-synchronization daemon, but is otherwise not used by the kernel. struct timeval time_maxerror; /* maximum error */ This variable represents the maximum error, expressed as a Unix timeval, of the system clock. For NTP, it is computed as the synchronization distance, which is equal to one-half the root delay plus the root dispersion. It is increased by a small amount (time_tolerance) each second to reflect the clock frequency tolerance. This variable is computed by the time-synchronization daemon and the kernel for use by the application program, but is otherwise not used by the kernel. struct timeval time_esterror; /* estimated error */ This variable represents the best estimate of the actual error, expressed as a Unix timeval, of the system clock based on its past behavior, together with observations of multiple clocks within the peer group. This variable is computed by the time-synchronization daemon for use by the application program, but is otherwise not used by the kernel. The PLL itself is controlled by the following variables: long time_phase = 0; /* phase offset (scaled us) */ long time_freq = 0; /* frequency offset (scaled ppm) */long time_adj = 0; /* tick adjust (scaled 1 / HZ) */ These variables control the phase increment and the frequency increment of the system clock at each tick of the clock. The time_phase variable is scaled by (1 << SHIFT_SCALE) in binary microseconds, giving a minimum value (time resolution) of 9.3e-10 us. The time_freq variable is scaled by (1 << SHIFT_KF) in parts-per-million (ppm), giving it a maximum value of about +-130 ppm and a minimum value (frequency resolution) of 6e-8 ppm. The time_adj variable is the actual phase increment in scaled microseconds to add to time_phase once each tick. It is computed from time_phase and time_freq once per second. long time_reftime = 0; /* time at last adjustment (s) */ This variable is the second's portion of the system time on the last call to adjtime(). It is used to adjust the time_freq variable as the time since the last update increases. The HZ define establishes the timer interrupt frequency, 256 Hz for the Ultrix kernel and 100 Hz for the SunOS kernel. The SHIFT_HZ define expresses the same value as the nearest power of two in order to avoid hardware multiply operations. These are the only parameters that need to be changed for different timer interrupt rates. #define HZ 256 /* timer interrupt frequency (Hz) */ #define SHIFT_HZ 8 /* log2(HZ) */ The following defines establish the engineering parameters of the PLL model. They are chosen for an initial convergence time of about an hour, an overshoot of about seven percent and a final convergence time of several hours, depending on initial frequency error. #define SHIFT_KG 10 /* shift for phase increment */ #define SHIFT_KF 24 /* shift for frequency increment */ #define SHIFT_TAU 0 /* default time constant (shift) */ The SHIFT_SCALE define establishes the decimal point on the time_phase variable which serves as a an extension to the low-order bits of the system clock variable. The SHIFT_UPDATE define establishes the decimal point of the phase portion of the adjtime() update. The FINEUSEC define represents 1 us in scaled units. #define SHIFT_SCALE 28 /* shift for scale factor */ #define SHIFT_UPDATE 14 /* shift for offset scale factor */ #define FINEUSEC (1 << SHIFT_SCALE) /* 1 us in scaled units */ The FINETUNE define represents the residual, in ppm, to be added to the system clock variable in addition to the integral 1-us value given by tick. This allows a systematic frequency offset in cases where the timer interrupt frequency does not exactly divide the second in microseconds. #define FINETUNE (1000000 - (1000000 / HZ) * HZ) /* frequency adjustment * for non-isochronous HZ (ppm) */ The following four defines establish the performance envelope of the PLL, one to bound the maximum phase error, another to bound the maximum frequency error and the last two to bound the minimum and maximum time between updates. The intent of these bounds is to force the PLL to operate within predefined limits in order to conform to the correctness models assumed by time-synchronization protocols like NTP and DTSS. An excursion which exceeds these bounds is clamped to the bound and operation proceeds accordingly. In practice, this can occur only if something has failed or is operating out of tolerance, but otherwise the PLL continues to operate in a stable mode. Note that the MAXPHASE define conforms to the maximum offset allowed in NTP before the system time is reset, rather than incrementally adjusted. #define MAXPHASE 128000 /* max phase error (us) */ #define MINSEC 64 /* min interval between updates (s) */ #define MAXFREQ 100 /* max frequency error (ppm) */ #define MAXSEC 1024 /* max interval between updates (s) */ 3.2. Code Segments The code segments illustrated in the simulator should make clear the operations at various points in the code. These segments are not derived from any licensed code. The hardupdate() fragment is called by adjtime() to update the system clock phase and frequency. This is an implementation of an adaptive-parameter, first-order, type-II phase-lock loop. Note that the time constant is in units of powers of two, so that multiplies can be done by simple shifts. The phase variable is computed as the offset multiplied by the time constant. Then, the time since the last update is computed and clamped to a maximum (for robustness) and to zero if initializing. The offset is multiplied (sorry about the ugly multiply) by the result and by the square of the time constant and then added to the frequency variable. Finally, the frequency variable is clamped not to exceed the tolerance. Note that all shifts are assumed to be positive and that a shift of a signed quantity to the right requires a litle dance. With the defines given, the maximum time offset is determined by the size in bits of the long type (32) less the SHIFT_UPDATE (14) scale factor or 18 bits (signed). The scale factor is chosen so that there is no loss of significance in later steps, which may involve a right shift up to 14 bits. This results in a maximum offset of about +-130 ms. Since the time_constant must be greater than or equal to zero, the maximum frequency offset is determined by the SHIFT_KF (24) scale factor, or about +-130 ppm. In the addition step the value of offset * mtemp is represented in 18 + 10 = 28 bits, which will not overflow a long add. There could be a loss of precision due to the right shift of up to eight bits, since time_constant is bounded at four. This results in a net worst-case frequency error of about 2^-16 us or well down into the oscillator phase noise. While the time_offset value is assumed checked before entry, the time_phase variable is an accumulator, so is clamped to the tolerance on every call. This helps to damp transients before the oscillator frequency has been determined, as well as to satisfy the correctness assertions if the time-synchronization protocol comes unstuck. The hardclock() fragment is inserted in the hardware timer interrupt routine at the point the system clock is to be incremented. The phase adjustment (time_adj) is added to the clock phase (time_phase) and tested for overflow of the microsecond. If an overflow occurs, the microsecond (tick) in incremented or decremented. The second_overflow() fragment is inserted at the point where the microseconds field of the system time variable is being checked for overflow. On rollover of the second the maximum error is increased by the tolerance. The time offset is divided by the phase weight (SHIFT_KG) and time constant. The time offset is then reduced by the result and the result is scaled and becomes the value of the phase adjustment. The phase adjustment is then corrected for the calculated frequency offset and a fixed offset FINETUNE which is a property of the architecture. On rollover of the day the leap-warning indicator is checked and the apparent time adjusted +-1 s accordingly. The gettimeofday() routine insures that the reported time is always monotonically increasing. The simulator can be used to check the loop operation over the design range of +-128 ms in time error and +-100 ppm in frequency error. This confirms that no overflows occur and that the loop initially converges in about 50-60 minutes for timer interrupt rates from 50 Hz to 1024 Hz. The loop has a normal overshoot of about seven percent and a final convergence time of several hours, depending on the initional frequency error. 3.3. Leap Seconds The leap-warning condition is determined by the synchronization protocol (if remotely synchronized), by the timecode receiver (if available), or by the operator (if awake). The time_status value must be set on the day the leap event is to occur (30 June or 31 December) and is automatically reset after the event. If the value is TIME_DEL, the kernel adds one second to the system time immediately following second 23:59:58 and resets time_status to TIME_OK. If the value is TIME_INS, the kernel subtracts one second from the system time immediately following second 23:59:59 and resets time_status to TIME_OOP, in effect causing system time to repeat second 59. Immediately following the repeated second, the kernel resets time_status to TIME_OK. Depending upon the system call implementation, the reported time during a leap second may repeat (with a return code set to advertise that fact) or be monotonically adjusted until system time "catches up" to reported time. With the latter scheme the reported time will be correct before and after the leap second, but freeze or slowly advance during the leap second itself. However, Most programs will probably use the ctime() library routine to convert from timeval (seconds, microseconds) format to tm format (seconds, minutes,...). If this routine is modified to inspect the return code of the gettimeofday() routine, it could simply report the leap second as second 60. To determine local midnight without fuss, the kernel simply finds the residue of the time.tv_sec value mod 86,400, but this requires a messy divide. Probably a better way to do this is to initialize an auxiliary counter in the settimeofday() routine using an ugly divide and increment the counter at the same time the time.tv_sec is incremented in the timer interrupt routine. For future embellishment. 4. Proposed Application Program Interface Most programs read the system clock using the gettimeofday() system call, which returns the system time and time-zone data. In the modified 5000/240 kernel, the gettimeofday() routine calls the microtime() routine, which interpolates between hardware timer interrupts to a precision of +-1 microsecond. However, the synchronization protocol provides additional information that will be of interest in many applications. For some applications it is necessary to know the maximum error of the reported time due to all causes, including those due to the system clock reading error, oscillator frequency error and accumulated errors due to intervening time servers on the path to a primary reference source. However, for those protocols that adjust the system clock frequency as well as the time offset, the errors expected in actual use will almost always be much less than the maximum error. Therefore, it is useful to report the estimated error, as well as the maximum error. It does not seem useful to provide additional details private to the kernel and synchronization protocol, such as stratum, reference identifier, reference timestamp and so forth. It would in principle be possible for the application to independently evaluate the quality of time and project into the future how long this time might be "valid." However, to do that properly would duplicate the functionality of the synchronization protocol and require knowledge of many mundane details of the platform architecture, such as the tick value, reachability status and related variables. Therefore, the application interface does not reveal anything except the time, timezone and error data. With respect to NTP, the data maintained by the protocol include the roundtrip delay and total dispersion to the source of synchronization. In terms of the above, the maximum error is computed as half the delay plus the dispersion, while the estimated error is equal to the dispersion. These are reported in timeval structures. A new system call is proposed that includes all the data in the gettimeofday() plus the two new timeval structures. The proposed interface involves modifications to the gettimeofday(), settimeofday() and adjtime() system calls, as well as new system calls to get and set various system parameters. In order to minimize confusion, by convention the new system calls are named with an "x" following the "time"; e.g., adjtime() becomes adjtimex(). The operation of the modified gettimexofday(), settimexofday() and adjtimex() system calls is identical to that of their prototypes, except for the error quantities and certain other side effects, as documented below. By convention, a NULL pointer can be used in place of any argument, in which case the argument is ignored. The synchronization protocol daemon needs to set and adjust the system clock and certain other kernel variables. It needs to read these variables for monitoring purposes as well. The present list of these include a subset of the variables defined previously: long time_precision long time_timeconstant long time_tolerance long time_freq long time_status /* * gettimexofday, settimexofday - get/set date and time */ #include <sys/timex.h> int gettimexofday(tp, tzp, tmaxp, testp) struct timeval *tp; /* system time */ struct timezone *tzp; /* timezone */ struct timeval *tmaxp; /* maximum error */ struct timeval *testp; /* estimated error */ The settimeofday() syntax is identical. Note that a call to settimexofday() automatically results in the system being declared unsynchronized (TIME_BAD return code), since the synchronization condition can only be achieved by the synchronization daemon using an internal or external primary reference source and the adjtimex() system call. /* * adjtimex - adjust system time */ #include <sys/timex.h> int adjtimex(tp, tzp, freq, tc) struct timeval *tp; /* system time */ struct timezone *tzp; /* timezone */ long freq; /* frequency adjustment */ long tc; /* time constant */ /* * getloop, setloop - get/set kernel time variables */ #include <sys/timex.h> int getloop(code, argp) int code; /* operation code */ long *argp; /* argument pointer */ The paticular kernal variables affected by these routines are selected by the operation code. Values presently defined in the header file timex.h are as follows: #define TIME_PREC 1 /* precision (log2(sec)) */ #define TIME_TCON 2 /* time constant (log2(sec) */ #define TIME_FREQ 3 /* frequency tolerance */ #define TIME_FREQ 4 /* frequency offset (scaled) */ #define TIME_STAT 5 /* status (see return codes) */ The getloop() syntax is identical. Comments welcome, but very little support is available: David L. Mills Electrical Engineering Department University of Delaware Newark, DE 19716 302 831 8247 fax 302 831 4316 mills@udel.edu