kern_tc.c - OpenGrok history log for /openbsd-src/sys/kern/kern

Revision	Date	Author	Comments
# 8dca5d44	22-Jun-2020	cheloha <cheloha@openbsd.org>	timecounting: add gettime(9), getuptime(9) time_second and time_uptime are used widely in the tree. This is a problem on 32-bit platforms because time_t is 64-bit, so there is a potential split-rea timecounting: add gettime(9), getuptime(9) time_second and time_uptime are used widely in the tree. This is a problem on 32-bit platforms because time_t is 64-bit, so there is a potential split-read whenever they are used at or below IPL_CLOCK. Here are two replacement interfaces: gettime(9) and getuptime(9). The "get" prefix signifies that they do not read the hardware timecounter, i.e. they are fast and low-res. The lack of a unit (e.g. micro, nano) signifies that they yield a plain time_t. As an optimization on LP64 platforms we can just return time_second or time_uptime, as a single read is atomic. On 32-bit platforms we need to do the lockless read loop and get the values from the timecounter. In a subsequent diff these will be substituted for time_second and time_uptime almost everywhere in the kernel. With input from visa@ and dlg@. ok kettenis@ show more ...
# c8f27247	29-May-2020	deraadt <deraadt@openbsd.org>	dev/rndvar.h no longer has statistical interfaces (removed during various conversion steps). it only contains kernel prototypes for 4 interfaces, all of which legitimately belong in sys/systm.h, whi dev/rndvar.h no longer has statistical interfaces (removed during various conversion steps). it only contains kernel prototypes for 4 interfaces, all of which legitimately belong in sys/systm.h, which are already included by all enqueue_randomness() users. show more ...
# 87ba7848	20-May-2020	cheloha <cheloha@openbsd.org>	timecounting: decide whether to advance offset within tc_windup() When we resume from a suspend we use the time from the RTC to advance the system offset. This changes the UTC to match what the RTC timecounting: decide whether to advance offset within tc_windup() When we resume from a suspend we use the time from the RTC to advance the system offset. This changes the UTC to match what the RTC has given us while increasing the system uptime to account for the time we were suspended. Currently we decide whether to change to the RTC time in tc_setclock() by comparing the new offset with the th_offset member. This is wrong. th_offset is the minimum possible value for the offset, not the "real offset". We need to perform the comparison within tc_windup() after updating th_offset, otherwise we might rewind said offset. Because we're now doing the comparison within tc_windup() we ought to move naptime into the timehands. This means we now need a way to safely read the naptime to compute the value of CLOCK_UPTIME for userspace. Enter nanoruntime(9); it increases monotonically from boot but does not jump forward after a resume like nanouptime(9). show more ...
# a40acd8a	12-Dec-2019	cheloha <cheloha@openbsd.org>	tc_setclock: reintroduce timeout_adjust_ticks() call Missing piece of tickless timeout revert.
# 9098a9c7	12-Dec-2019	cheloha <cheloha@openbsd.org>	Recommit "tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment" Reverted with backout of tickless timeouts. Original commit message: We currently mix timecounter.tc_freq_adj an Recommit "tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment" Reverted with backout of tickless timeouts. Original commit message: We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustment, our net skew. But if you set a low enough adjfreq(2) adjustment you can freeze time. This prevents ntp_update_second() from running again. So even if you then set a sane adjfreq(2) you cannot unfreeze time without rebooting. If we just reread timecounter.tc_freq_adj every time we recompute timehands.th_scale we avoid this trap. visa@ notes that this is more costly than what we currently do but that the cost itself is negligible. Intuitively, timecounter.tc_freq_adj is a constant skew and should be handled separately from timehands.th_adjtimedelta, an adjustment that we chip away at very slowly. tedu@ notes that this problem is sort-of an argument for imposing range limits on adjfreq(2) inputs. He's right, but I think we should still separate the counter adjustment from the adjtime(2) adjustment, with or without range limits. ok visa@ show more ...
# 7be7e9e8	02-Dec-2019	cheloha <cheloha@openbsd.org>	Revert "timeout(9): switch to tickless backend" It appears to have caused major performance regressions all over the network stack. Reported by bluhm@ ok deraadt@
# 1c733b92	02-Dec-2019	cheloha <cheloha@openbsd.org>	tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustm tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta in ntp_update_second() to produce timehands.th_adjustment, our net skew. But if you set a low enough adjfreq(2) adjustment you can freeze time. This prevents ntp_update_second() from running again. So even if you then set a sane adjfreq(2) you cannot unfreeze time without rebooting. If we just reread timecounter.tc_freq_adj every time we recompute timehands.th_scale we avoid this trap. visa@ notes that this is more costly than what we currently do but that the cost itself is negligible. Intuitively, timecounter.tc_freq_adj is a constant skew and should be handled separately from timehands.th_adjtimedelta, an adjustment that we chip away at very slowly. tedu@ notes that this problem is sort-of an argument for imposing range limits on adjfreq(2) inputs. He's right, but I think we should still separate the counter adjustment from the adjtime(2) adjustment, with or without range limits. ok visa@ show more ...
# 4b479330	26-Nov-2019	cheloha <cheloha@openbsd.org>	timeout(9): switch to tickless backend Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus timeout(9): switch to tickless backend Rebase the timeout wheel on the system uptime clock. Timeouts are now set to run at or after an absolute time as returned by nanouptime(9). Timeouts are thus "tickless": they expire at a real time on that clock instead of at a particular value of the global "ticks" variable. To facilitate this change the timeout struct's .to_time member becomes a timespec. Hashing timeouts into a bucket on the wheel changes slightly: we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of subseconds (.tv_nsec). 7 bits of subseconds means the width of the lowest wheel level is now 2 seconds on all platforms and each bucket in that lowest level corresponds to 1/128 seconds on the uptime clock. These values were chosen to closely align with the current 100hz hardclock(9) typical on almost all of our platforms. At 100hz a bucket is currently ~1/100 seconds wide on the lowest level and the lowest level itself is ~2.56 seconds wide. Not a huge change, but a change nonetheless. Because a bucket no longer corresponds to a single tick more than one bucket may be dumped during an average timeout_hardclock_update() call. On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket, but you are doing extra work in softclock() to reschedule timeouts that aren't due yet. To avoid changing current behavior all timeout_add*(9) interfaces convert their timeout interval into ticks, compute an equivalent timespec interval, and then add that interval to the timestamp of the most recent timeout_hardclock_update() call to determine an absolute deadline. So all current timeouts still "use" ticks, but the ticks are faked in the timeout layer. A new interface, timeout_at_ts(9), is introduced here to bypass this backwardly compatible behavior. It will be used in subsequent diffs to add absolute timeout support for userland and to clean up some of the messier parts of kernel timekeeping, especially at the syscall layer. Because timeouts are based against the uptime clock they are subject to NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy adjfreq(2) adjustment set this will not change the expiration behavior of your timeouts. Tons of design feedback from mpi@, visa@, guenther@, and kettenis@. Additional amd64 testing from anton@ and visa@. Octeon testing from visa@. macppc testing from me. Positive feedback from deraadt@, ok visa@ show more ...
# 875f2e32	26-Oct-2019	cheloha <cheloha@openbsd.org>	clock_getres(2): actually return the resolution of the given clock Currently we return (1000000000 / hz) from clock_getres(2) as the resolution for every clock. This is often untrue. For CPUTIME c clock_getres(2): actually return the resolution of the given clock Currently we return (1000000000 / hz) from clock_getres(2) as the resolution for every clock. This is often untrue. For CPUTIME clocks, if we have a separate statclock interrupt the resolution is (1000000000 / stathz). Otherwise it is as we currently claim: (1000000000 / hz). For the REALTIME/MONOTONIC/UPTIME/BOOTTIME clocks the resolution is that of the active timecounter. During tc_init() we can compute the precision of a timecounter by examining its tc_counter_mask and store it for lookup later in a new member, tc_precision. The resolution of a clock backed by a timecounter "tc" is then tc.tc_precision * (2^64 / tc.tc_frequency) fractional seconds. While here we can clean up sys_clock_getres() a bit. Standards input from guenther@. Lots of input, feedback from kettenis@. ok kettenis@ show more ...
# 02f434f1	22-Oct-2019	cheloha <cheloha@openbsd.org>	nanoboottime(9): add and document new interface Wanted for upcoming process accounting changes, maybe useful elsewhere. ok bluhm@ millert@
# 75b45b05	03-Jun-2019	cheloha <cheloha@openbsd.org>	Switch from bintime_add() et al. to bintimeadd(9). Basically just make all the bintime routines look and behave more like the timeradd(3) macros. Switch to three-argument forms for structure math, Switch from bintime_add() et al. to bintimeadd(9). Basically just make all the bintime routines look and behave more like the timeradd(3) macros. Switch to three-argument forms for structure math, introduce and use bintimecmp(9), and rename the structure conversion routines to resemble e.g. TIMEVAL_TO_TIMESPEC(3). Document all of this in a new bintimeadd.9 page. Code input from mpi@, manpage input from schwarze@. code ok mpi@, docs ok schwarze@, docs probably still ok jmc@ show more ...
# 8334e679	22-May-2019	cheloha <cheloha@openbsd.org>	SLIST-ify the timecounter list. Call it "tc_list" instead of "timecounters", which is too similar to the variable "timecounter" for my taste. ok mpi@ visa@
# 08e05d41	20-May-2019	cheloha <cheloha@openbsd.org>	kern.timecounter.choices: Don't offer the dummy counter as an option. The dummy counter is a stopgap during boot. It is not useful after a real timecounter is attached and started and there is no r kern.timecounter.choices: Don't offer the dummy counter as an option. The dummy counter is a stopgap during boot. It is not useful after a real timecounter is attached and started and there is no reason to return to using it. So don't even offer it to the admin. This is easy: never add it to the timecounter list. It will effectively cease to exist after the first real timecounter is actived in tc_init(). In principle this means that we can have an empty timecounter list so we need to check for that case in sysctl_tc_choice(). "I don't mind" mpi@, ok visa@ show more ...
# 74106511	10-May-2019	cheloha <cheloha@openbsd.org>	Reduce number of timehands from to just two. Reduces the worst-case error for for time values retrieved via the microtime(9) functions from 10 ticks to 2 ticks. Being interrupted for over a tick is Reduce number of timehands from to just two. Reduces the worst-case error for for time values retrieved via the microtime(9) functions from 10 ticks to 2 ticks. Being interrupted for over a tick is unlikely but possible. While here use C99 initializers. From FreeBSD r303383. ok mpi@ show more ...
# e98df54a	30-Apr-2019	cheloha <cheloha@openbsd.org>	tc_setclock: always call tc_windup() before leaving windup_mtx. We ought to conform to the windup_mtx protocol and call tc_windup() even if we aren't changing the system uptime.
# af3eeb45	25-Mar-2019	cheloha <cheloha@openbsd.org>	MP-safe timecounting: new rwlock: tc_lock tc_lock allows adjfreq(2) and the kern.timecounter.hardware sysctl(2) to read/write the active timecounter pointer and the .tc_adj_freq member of the active MP-safe timecounting: new rwlock: tc_lock tc_lock allows adjfreq(2) and the kern.timecounter.hardware sysctl(2) to read/write the active timecounter pointer and the .tc_adj_freq member of the active timecounter safely. This eliminates any possibility of a torn read/write for the .tc_adj_freq member when we drop the KERNEL_LOCK from the timecounting layer. It also ensures the active timecounter does not change in the midst of an adjfreq(2) call. Because these are not high-traffic paths, we can get away with using tc_lock in write-mode to ensure combination read/write adjtime(2) calls are relatively atomic (a) to other writer adjtime(2) calls, and (b) to settimeofday(2)/clock_settime(2) calls, which cancel ongoing adjtime(2) adjustment. When the KERNEL_LOCK is dropped, an unprivileged user will be able to create some tc_lock contention via adjfreq(2); it is very unlikely to ever be a problem. If it ever is actually a problem a lockless read could be added to address it. While here, reorganize sys_adjfreq()/sys_adjtime() to minimize code under the lock. Also while here, make tc_adjfreq() void, as it cannot fail under any circumstance. Also also while here, annotate various globals/struct members with lock ordering details. With lots of input from mpi@ and visa@. ok visa@ show more ...
# c54148e4	22-Mar-2019	cheloha <cheloha@openbsd.org>	Move adjtimedelta into the timehands. adjtimedelta is 64-bit and thus can't be read/written atomically on all architectures. Because it can be modified from tc_windup() and ntp_update_second() we n Move adjtimedelta into the timehands. adjtimedelta is 64-bit and thus can't be read/written atomically on all architectures. Because it can be modified from tc_windup() and ntp_update_second() we need a way to ensure safe reads/writes for adjtime(2) callers. One solution is to move it into the timehands and adopt the lockless read protocol we now use for the system boot time and uptime. So make new_adjtimedelta an argument to tc_windup() and add a lockless read loop to tc_adjtime(). With adjtimedelta stored in the timehands we can now simply pass a timehands pointer to ntp_update_second(). This makes ntp_update_second() safer as we're using the timehands' timecounter pointer instead of the mutable global timecounter pointer. Lots of input from mpi@ and visa@. ok visa@ show more ...
# ceab5aef	22-Mar-2019	cheloha <cheloha@openbsd.org>	Rename "timecounter_mtx" to "windup_mtx". This will make upcoming MP-related diffs smaller and should make the code int kern_tc.c easier to read in general. "windup_mtx" is also a better mnemonic: Rename "timecounter_mtx" to "windup_mtx". This will make upcoming MP-related diffs smaller and should make the code int kern_tc.c easier to read in general. "windup_mtx" is also a better mnemonic: always call tc_windup() before leaving windup_mtx. show more ...
# 7c21e1f3	17-Mar-2019	cheloha <cheloha@openbsd.org>	Change boot time/offset within tc_windup(). We need to perform the actual modification of the boot offset and the time-of-boot within the "safe zone" in tc_windup() where the timehands' generation i Change boot time/offset within tc_windup(). We need to perform the actual modification of the boot offset and the time-of-boot within the "safe zone" in tc_windup() where the timehands' generation is zero to conform to the timehands lockless read protocol. Based on FreeBSD r303387. Discussed with mpi@ and visa@. ok visa@ show more ...
# 3c2e3f4b	10-Mar-2019	cheloha <cheloha@openbsd.org>	Move adjtimedelta from kern_time.c to kern_tc.c. This will simplify upcoming MP-safety diffs for the timecounting layer. adjtimedelta is now accessed nowhere outside of kern_tc.c, so we can remove Move adjtimedelta from kern_time.c to kern_tc.c. This will simplify upcoming MP-safety diffs for the timecounting layer. adjtimedelta is now accessed nowhere outside of kern_tc.c, so we can remove its extern declaration from kernel.h. Zeroing adjtimedelta within timecounter_mtx before we jump the real-time clock is also a bit safer than what we do now, as we are not racing a simultaneous tc_windup() call from hardclock(), which itself can modify adjtimedelta via ntp_update_second(). Discussed with visa@ and mpi@. ok visa@ show more ...
# 827d5adb	09-Mar-2019	cheloha <cheloha@openbsd.org>	tc_windup: read active timecounter once at function start. tc_windup() is not necessarily called with KERNEL_LOCK, so it is possible for the timecounter pointer to change in the midst of the call vi tc_windup: read active timecounter once at function start. tc_windup() is not necessarily called with KERNEL_LOCK, so it is possible for the timecounter pointer to change in the midst of the call via the kern.timecounter.hardware sysctl(2). Reading it once and using that local copy ensures we're referring to the same timecounter consistently. Apparently the compiler can optimize this out... somehow... so there may be room for improvement. Idea from visa@. With input from visa@, mpi@, cjeker@, and guenther@. ok visa@ mpi@ show more ...
# e12a049b	31-Jan-2019	cheloha <cheloha@openbsd.org>	tc_setclock: Don't rewind the system uptime during resume/unhibernate. When we come back from suspend/hibernate the BIOS/firmware/whatever can hand us any TOD, so we need to check that the given T tc_setclock: Don't rewind the system uptime during resume/unhibernate. When we come back from suspend/hibernate the BIOS/firmware/whatever can hand us any TOD, so we need to check that the given TOD doesn't set our boot offset backwards, breaking the monotonicity of e.g. CLOCK_MONOTONIC. This is trivial to do from the BIOS on most PCs before unhibernating. There might be other ways it can happen, accidentally or otherwise. This is a bit messy but it can be made prettier later with a "bintimecmp" macro or something like that. Problem confirmed by jmatthew@. "you are very likely right" deraadt@ show more ...
# 1d8de610	20-Jan-2019	cheloha <cheloha@openbsd.org>	Serialize tc_windup() calls and modification of some timehands members. If a user thread from e.g. clock_settime(2) is in the midst of changing the boottime or calling tc_windup() when it is interru Serialize tc_windup() calls and modification of some timehands members. If a user thread from e.g. clock_settime(2) is in the midst of changing the boottime or calling tc_windup() when it is interrupted by hardclock(9), the timehands could be left in a damaged state. So protect tc_windup() calls with a mutex, timecounter_mtx. hardclock(9) merely attempts to enter the mutex instead of spinning because it cannot afford to wait around. In practice hardclock(9) will skip tc_windup() very rarely, and when it does skip there aren't any negative effects because the skip indicates that a user thread is already calling, or about to call, tc_windup() anyway. Based on FreeBSD r303387 and NetBSD sys/kern/kern_tc.c,v1.30 Discussed with mpi@ and visa@. Tons of nice technical detail about lockless reads from visa@. OK visa@ show more ...
# fa5a0c50	19-Jan-2019	cheloha <cheloha@openbsd.org>	Move boottime into the timehands. To protect the timehands we first need to protect the basis for all UTC time in the kernel: the boottime. Because the boottime can be changed at any time it needs Move boottime into the timehands. To protect the timehands we first need to protect the basis for all UTC time in the kernel: the boottime. Because the boottime can be changed at any time it needs to be versioned along with the other members of the timehands to enable safe lockless reads when using it for anything. So the global boottime timespec goes away and the static boottimebin becomes a member of the timehands. Instead of reading the global boottime you use one of two interfaces: binboottime(9) or microboottime(9). nanoboottime(9) can trivially be added later, though there are no consumers for it at the moment. This introduces one small change in behavior. We used to advance the reported boottime just before launching kernel threads from main(). This makes it look to userland like we "booted" moments before those threads were launched. Because there is no longer a boottime global we can no longer trivially do this from main(), so the boottime we report to userspace via e.g. kern.boottime will now reflect whatever the time was when we bootstrapped the timehands via inittodr(9). This is usually no more than a minute before the kernel threads are launched from main(). The prior behavior can be restored by adding a new interface to the timecounter layer in a future commit. Based on FreeBSD r303387. Discussed with mpi@ and visa@. ok visa@ show more ...
# a701e5df	18-Sep-2018	bluhm <bluhm@openbsd.org>	Updating time counters without memory barriers is wrong. Put membar_producer() into tc_windup() and membar_consumer() into the uptime functions. They order the visibility of the time and generation Updating time counters without memory barriers is wrong. Put membar_producer() into tc_windup() and membar_consumer() into the uptime functions. They order the visibility of the time and generation number updates. This is a combination of what NetBSD and FreeBSD do. OK kettenis@ show more ...
123 4