History log of /openbsd-src/sys/kern/kern_tc.c (Results 26 – 50 of 83)
Revision Date Author Comments
# 8dca5d44 22-Jun-2020 cheloha <cheloha@openbsd.org>

timecounting: add gettime(9), getuptime(9)

time_second and time_uptime are used widely in the tree. This is a
problem on 32-bit platforms because time_t is 64-bit, so there is a
potential split-rea

timecounting: add gettime(9), getuptime(9)

time_second and time_uptime are used widely in the tree. This is a
problem on 32-bit platforms because time_t is 64-bit, so there is a
potential split-read whenever they are used at or below IPL_CLOCK.

Here are two replacement interfaces: gettime(9) and getuptime(9).
The "get" prefix signifies that they do not read the hardware
timecounter, i.e. they are fast and low-res. The lack of a unit
(e.g. micro, nano) signifies that they yield a plain time_t.

As an optimization on LP64 platforms we can just return time_second or
time_uptime, as a single read is atomic. On 32-bit platforms we need
to do the lockless read loop and get the values from the timecounter.

In a subsequent diff these will be substituted for time_second and
time_uptime almost everywhere in the kernel.

With input from visa@ and dlg@.

ok kettenis@

show more ...


# c8f27247 29-May-2020 deraadt <deraadt@openbsd.org>

dev/rndvar.h no longer has statistical interfaces (removed during various
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, whi

dev/rndvar.h no longer has statistical interfaces (removed during various
conversion steps). it only contains kernel prototypes for 4 interfaces,
all of which legitimately belong in sys/systm.h, which are already included
by all enqueue_randomness() users.

show more ...


# 87ba7848 20-May-2020 cheloha <cheloha@openbsd.org>

timecounting: decide whether to advance offset within tc_windup()

When we resume from a suspend we use the time from the RTC to advance
the system offset. This changes the UTC to match what the RTC

timecounting: decide whether to advance offset within tc_windup()

When we resume from a suspend we use the time from the RTC to advance
the system offset. This changes the UTC to match what the RTC has given
us while increasing the system uptime to account for the time we were
suspended.

Currently we decide whether to change to the RTC time in tc_setclock()
by comparing the new offset with the th_offset member. This is wrong.
th_offset is the *minimum* possible value for the offset, not the "real
offset". We need to perform the comparison within tc_windup() after
updating th_offset, otherwise we might rewind said offset.

Because we're now doing the comparison within tc_windup() we ought to
move naptime into the timehands. This means we now need a way to safely
read the naptime to compute the value of CLOCK_UPTIME for userspace.
Enter nanoruntime(9); it increases monotonically from boot but does not
jump forward after a resume like nanouptime(9).

show more ...


# a40acd8a 12-Dec-2019 cheloha <cheloha@openbsd.org>

tc_setclock: reintroduce timeout_adjust_ticks() call

Missing piece of tickless timeout revert.


# 9098a9c7 12-Dec-2019 cheloha <cheloha@openbsd.org>

Recommit "tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment"

Reverted with backout of tickless timeouts.

Original commit message:

We currently mix timecounter.tc_freq_adj an

Recommit "tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment"

Reverted with backout of tickless timeouts.

Original commit message:

We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustment, our net skew.
But if you set a low enough adjfreq(2) adjustment you can freeze time.
This prevents ntp_update_second() from running again. So even if you
then set a sane adjfreq(2) you cannot unfreeze time without rebooting.

If we just reread timecounter.tc_freq_adj every time we recompute
timehands.th_scale we avoid this trap. visa@ notes that this is
more costly than what we currently do but that the cost itself is
negligible.

Intuitively, timecounter.tc_freq_adj is a constant skew and should be
handled separately from timehands.th_adjtimedelta, an adjustment that
we chip away at very slowly.

tedu@ notes that this problem is sort-of an argument for imposing range
limits on adjfreq(2) inputs. He's right, but I think we should still
separate the counter adjustment from the adjtime(2) adjustment, with
or without range limits.

ok visa@

show more ...


# 7be7e9e8 02-Dec-2019 cheloha <cheloha@openbsd.org>

Revert "timeout(9): switch to tickless backend"

It appears to have caused major performance regressions all over the
network stack.

Reported by bluhm@

ok deraadt@


# 1c733b92 02-Dec-2019 cheloha <cheloha@openbsd.org>

tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment

We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustm

tc_windup: separate timecounter.tc_freq_adj from timehands.th_adjustment

We currently mix timecounter.tc_freq_adj and timehands.th_adjtimedelta
in ntp_update_second() to produce timehands.th_adjustment, our net skew.
But if you set a low enough adjfreq(2) adjustment you can freeze time.
This prevents ntp_update_second() from running again. So even if you
then set a sane adjfreq(2) you cannot unfreeze time without rebooting.

If we just reread timecounter.tc_freq_adj every time we recompute
timehands.th_scale we avoid this trap. visa@ notes that this is
more costly than what we currently do but that the cost itself is
negligible.

Intuitively, timecounter.tc_freq_adj is a constant skew and should be
handled separately from timehands.th_adjtimedelta, an adjustment that
we chip away at very slowly.

tedu@ notes that this problem is sort-of an argument for imposing range
limits on adjfreq(2) inputs. He's right, but I think we should still
separate the counter adjustment from the adjtime(2) adjustment, with
or without range limits.

ok visa@

show more ...


# 4b479330 26-Nov-2019 cheloha <cheloha@openbsd.org>

timeout(9): switch to tickless backend

Rebase the timeout wheel on the system uptime clock. Timeouts are now
set to run at or after an absolute time as returned by nanouptime(9).
Timeouts are thus

timeout(9): switch to tickless backend

Rebase the timeout wheel on the system uptime clock. Timeouts are now
set to run at or after an absolute time as returned by nanouptime(9).
Timeouts are thus "tickless": they expire at a real time on that clock
instead of at a particular value of the global "ticks" variable.

To facilitate this change the timeout struct's .to_time member becomes a
timespec. Hashing timeouts into a bucket on the wheel changes slightly:
we build a 32-bit hash with 25 bits of seconds (.tv_sec) and 7 bits of
subseconds (.tv_nsec). 7 bits of subseconds means the width of the
lowest wheel level is now 2 seconds on all platforms and each bucket in
that lowest level corresponds to 1/128 seconds on the uptime clock.
These values were chosen to closely align with the current 100hz
hardclock(9) typical on almost all of our platforms. At 100hz a bucket
is currently ~1/100 seconds wide on the lowest level and the lowest
level itself is ~2.56 seconds wide. Not a huge change, but a change
nonetheless.

Because a bucket no longer corresponds to a single tick more than one
bucket may be dumped during an average timeout_hardclock_update() call.
On 100hz platforms you now dump ~2 buckets. On 64hz machines (sh) you
dump ~4 buckets. On 1024hz machines (alpha) you dump only 1 bucket,
but you are doing extra work in softclock() to reschedule timeouts
that aren't due yet.

To avoid changing current behavior all timeout_add*(9) interfaces
convert their timeout interval into ticks, compute an equivalent
timespec interval, and then add that interval to the timestamp of
the most recent timeout_hardclock_update() call to determine an
absolute deadline. So all current timeouts still "use" ticks,
but the ticks are faked in the timeout layer.

A new interface, timeout_at_ts(9), is introduced here to bypass this
backwardly compatible behavior. It will be used in subsequent diffs
to add absolute timeout support for userland and to clean up some of
the messier parts of kernel timekeeping, especially at the syscall
layer.

Because timeouts are based against the uptime clock they are subject to
NTP adjustment via adjtime(2) and adjfreq(2). Unless you have a crazy
adjfreq(2) adjustment set this will not change the expiration behavior
of your timeouts.

Tons of design feedback from mpi@, visa@, guenther@, and kettenis@.
Additional amd64 testing from anton@ and visa@. Octeon testing from visa@.
macppc testing from me.

Positive feedback from deraadt@, ok visa@

show more ...


# 875f2e32 26-Oct-2019 cheloha <cheloha@openbsd.org>

clock_getres(2): actually return the resolution of the given clock

Currently we return (1000000000 / hz) from clock_getres(2) as the
resolution for every clock. This is often untrue.

For CPUTIME c

clock_getres(2): actually return the resolution of the given clock

Currently we return (1000000000 / hz) from clock_getres(2) as the
resolution for every clock. This is often untrue.

For CPUTIME clocks, if we have a separate statclock interrupt the
resolution is (1000000000 / stathz). Otherwise it is as we currently
claim: (1000000000 / hz).

For the REALTIME/MONOTONIC/UPTIME/BOOTTIME clocks the resolution is
that of the active timecounter. During tc_init() we can compute the
precision of a timecounter by examining its tc_counter_mask and store
it for lookup later in a new member, tc_precision. The resolution of
a clock backed by a timecounter "tc" is then

tc.tc_precision * (2^64 / tc.tc_frequency)

fractional seconds.

While here we can clean up sys_clock_getres() a bit.

Standards input from guenther@. Lots of input, feedback from
kettenis@.

ok kettenis@

show more ...


# 02f434f1 22-Oct-2019 cheloha <cheloha@openbsd.org>

nanoboottime(9): add and document new interface

Wanted for upcoming process accounting changes, maybe useful elsewhere.

ok bluhm@ millert@


# 75b45b05 03-Jun-2019 cheloha <cheloha@openbsd.org>

Switch from bintime_add() et al. to bintimeadd(9).

Basically just make all the bintime routines look and behave more like
the timeradd(3) macros.

Switch to three-argument forms for structure math,

Switch from bintime_add() et al. to bintimeadd(9).

Basically just make all the bintime routines look and behave more like
the timeradd(3) macros.

Switch to three-argument forms for structure math, introduce and use
bintimecmp(9), and rename the structure conversion routines to resemble
e.g. TIMEVAL_TO_TIMESPEC(3).

Document all of this in a new bintimeadd.9 page.

Code input from mpi@, manpage input from schwarze@.

code ok mpi@, docs ok schwarze@, docs probably still ok jmc@

show more ...


# 8334e679 22-May-2019 cheloha <cheloha@openbsd.org>

SLIST-ify the timecounter list.

Call it "tc_list" instead of "timecounters", which is too similar to
the variable "timecounter" for my taste.

ok mpi@ visa@


# 08e05d41 20-May-2019 cheloha <cheloha@openbsd.org>

kern.timecounter.choices: Don't offer the dummy counter as an option.

The dummy counter is a stopgap during boot. It is not useful after a
real timecounter is attached and started and there is no r

kern.timecounter.choices: Don't offer the dummy counter as an option.

The dummy counter is a stopgap during boot. It is not useful after a
real timecounter is attached and started and there is no reason to return
to using it.

So don't even offer it to the admin. This is easy: never add it to the
timecounter list. It will effectively cease to exist after the first real
timecounter is actived in tc_init().

In principle this means that we can have an empty timecounter list so we
need to check for that case in sysctl_tc_choice().

"I don't mind" mpi@, ok visa@

show more ...


# 74106511 10-May-2019 cheloha <cheloha@openbsd.org>

Reduce number of timehands from to just two.

Reduces the worst-case error for for time values retrieved via the
microtime(9) functions from 10 ticks to 2 ticks. Being interrupted
for over a tick is

Reduce number of timehands from to just two.

Reduces the worst-case error for for time values retrieved via the
microtime(9) functions from 10 ticks to 2 ticks. Being interrupted
for over a tick is unlikely but possible.

While here use C99 initializers.

From FreeBSD r303383.

ok mpi@

show more ...


# e98df54a 30-Apr-2019 cheloha <cheloha@openbsd.org>

tc_setclock: always call tc_windup() before leaving windup_mtx.

We ought to conform to the windup_mtx protocol and call tc_windup() even
if we aren't changing the system uptime.


# af3eeb45 25-Mar-2019 cheloha <cheloha@openbsd.org>

MP-safe timecounting: new rwlock: tc_lock

tc_lock allows adjfreq(2) and the kern.timecounter.hardware sysctl(2)
to read/write the active timecounter pointer and the .tc_adj_freq
member of the active

MP-safe timecounting: new rwlock: tc_lock

tc_lock allows adjfreq(2) and the kern.timecounter.hardware sysctl(2)
to read/write the active timecounter pointer and the .tc_adj_freq
member of the active timecounter safely. This eliminates any possibility
of a torn read/write for the .tc_adj_freq member when we drop the
KERNEL_LOCK from the timecounting layer. It also ensures the active
timecounter does not change in the midst of an adjfreq(2) call.

Because these are not high-traffic paths, we can get away with using
tc_lock in write-mode to ensure combination read/write adjtime(2) calls
are relatively atomic (a) to other writer adjtime(2) calls, and (b) to
settimeofday(2)/clock_settime(2) calls, which cancel ongoing adjtime(2)
adjustment.

When the KERNEL_LOCK is dropped, an unprivileged user will be able to
create some tc_lock contention via adjfreq(2); it is very unlikely to
ever be a problem. If it ever is actually a problem a lockless read
could be added to address it.

While here, reorganize sys_adjfreq()/sys_adjtime() to minimize code
under the lock. Also while here, make tc_adjfreq() void, as it cannot
fail under any circumstance. Also also while here, annotate various
globals/struct members with lock ordering details.

With lots of input from mpi@ and visa@.

ok visa@

show more ...


# c54148e4 22-Mar-2019 cheloha <cheloha@openbsd.org>

Move adjtimedelta into the timehands.

adjtimedelta is 64-bit and thus can't be read/written atomically on all
architectures. Because it can be modified from tc_windup() and
ntp_update_second() we n

Move adjtimedelta into the timehands.

adjtimedelta is 64-bit and thus can't be read/written atomically on all
architectures. Because it can be modified from tc_windup() and
ntp_update_second() we need a way to ensure safe reads/writes for
adjtime(2) callers. One solution is to move it into the timehands and
adopt the lockless read protocol we now use for the system boot time and
uptime.

So make new_adjtimedelta an argument to tc_windup() and add a lockless
read loop to tc_adjtime(). With adjtimedelta stored in the timehands
we can now simply pass a timehands pointer to ntp_update_second(). This
makes ntp_update_second() safer as we're using the timehands' timecounter
pointer instead of the mutable global timecounter pointer.

Lots of input from mpi@ and visa@.

ok visa@

show more ...


# ceab5aef 22-Mar-2019 cheloha <cheloha@openbsd.org>

Rename "timecounter_mtx" to "windup_mtx".

This will make upcoming MP-related diffs smaller and should make the code
int kern_tc.c easier to read in general. "windup_mtx" is also a better
mnemonic:

Rename "timecounter_mtx" to "windup_mtx".

This will make upcoming MP-related diffs smaller and should make the code
int kern_tc.c easier to read in general. "windup_mtx" is also a better
mnemonic: always call tc_windup() before leaving windup_mtx.

show more ...


# 7c21e1f3 17-Mar-2019 cheloha <cheloha@openbsd.org>

Change boot time/offset within tc_windup().

We need to perform the actual modification of the boot offset and the
time-of-boot within the "safe zone" in tc_windup() where the timehands'
generation i

Change boot time/offset within tc_windup().

We need to perform the actual modification of the boot offset and the
time-of-boot within the "safe zone" in tc_windup() where the timehands'
generation is zero to conform to the timehands lockless read protocol.

Based on FreeBSD r303387.

Discussed with mpi@ and visa@.

ok visa@

show more ...


# 3c2e3f4b 10-Mar-2019 cheloha <cheloha@openbsd.org>

Move adjtimedelta from kern_time.c to kern_tc.c.

This will simplify upcoming MP-safety diffs for the timecounting layer.

adjtimedelta is now accessed nowhere outside of kern_tc.c, so we can
remove

Move adjtimedelta from kern_time.c to kern_tc.c.

This will simplify upcoming MP-safety diffs for the timecounting layer.

adjtimedelta is now accessed nowhere outside of kern_tc.c, so we can
remove its extern declaration from kernel.h. Zeroing adjtimedelta
within timecounter_mtx before we jump the real-time clock is also a
bit safer than what we do now, as we are not racing a simultaneous
tc_windup() call from hardclock(), which itself can modify adjtimedelta
via ntp_update_second().

Discussed with visa@ and mpi@.

ok visa@

show more ...


# 827d5adb 09-Mar-2019 cheloha <cheloha@openbsd.org>

tc_windup: read active timecounter once at function start.

tc_windup() is not necessarily called with KERNEL_LOCK, so it is possible
for the timecounter pointer to change in the midst of the call vi

tc_windup: read active timecounter once at function start.

tc_windup() is not necessarily called with KERNEL_LOCK, so it is possible
for the timecounter pointer to change in the midst of the call via the
kern.timecounter.hardware sysctl(2). Reading it once and using that local
copy ensures we're referring to the same timecounter consistently.

Apparently the compiler can optimize this out... somehow... so there may
be room for improvement.

Idea from visa@. With input from visa@, mpi@, cjeker@, and guenther@.

ok visa@ mpi@

show more ...


# e12a049b 31-Jan-2019 cheloha <cheloha@openbsd.org>

tc_setclock: Don't rewind the system uptime during resume/unhibernate.

When we come back from suspend/hibernate the BIOS/firmware/whatever can
hand us *any* TOD, so we need to check that the given T

tc_setclock: Don't rewind the system uptime during resume/unhibernate.

When we come back from suspend/hibernate the BIOS/firmware/whatever can
hand us *any* TOD, so we need to check that the given TOD doesn't set our
boot offset backwards, breaking the monotonicity of e.g. CLOCK_MONOTONIC.
This is trivial to do from the BIOS on most PCs before unhibernating.
There might be other ways it can happen, accidentally or otherwise.

This is a bit messy but it can be made prettier later with a "bintimecmp"
macro or something like that.

Problem confirmed by jmatthew@.

"you are very likely right" deraadt@

show more ...


# 1d8de610 20-Jan-2019 cheloha <cheloha@openbsd.org>

Serialize tc_windup() calls and modification of some timehands members.

If a user thread from e.g. clock_settime(2) is in the midst of changing
the boottime or calling tc_windup() when it is interru

Serialize tc_windup() calls and modification of some timehands members.

If a user thread from e.g. clock_settime(2) is in the midst of changing
the boottime or calling tc_windup() when it is interrupted by hardclock(9),
the timehands could be left in a damaged state.

So protect tc_windup() calls with a mutex, timecounter_mtx. hardclock(9)
merely attempts to enter the mutex instead of spinning because it cannot
afford to wait around. In practice hardclock(9) will skip tc_windup() very
rarely, and when it does skip there aren't any negative effects because the
skip indicates that a user thread is already calling, or about to call,
tc_windup() anyway.

Based on FreeBSD r303387 and NetBSD sys/kern/kern_tc.c,v1.30

Discussed with mpi@ and visa@. Tons of nice technical detail about
lockless reads from visa@.

OK visa@

show more ...


# fa5a0c50 19-Jan-2019 cheloha <cheloha@openbsd.org>

Move boottime into the timehands.

To protect the timehands we first need to protect the basis for all UTC
time in the kernel: the boottime.

Because the boottime can be changed at any time it needs

Move boottime into the timehands.

To protect the timehands we first need to protect the basis for all UTC
time in the kernel: the boottime.

Because the boottime can be changed at any time it needs to be versioned
along with the other members of the timehands to enable safe lockless reads
when using it for anything. So the global boottime timespec goes away and
the static boottimebin becomes a member of the timehands. Instead of reading
the global boottime you use one of two interfaces: binboottime(9) or
microboottime(9). nanoboottime(9) can trivially be added later, though there
are no consumers for it at the moment.

This introduces one small change in behavior. We used to advance the
reported boottime just before launching kernel threads from main().
This makes it look to userland like we "booted" moments before those
threads were launched. Because there is no longer a boottime global we
can no longer trivially do this from main(), so the boottime we report
to userspace via e.g. kern.boottime will now reflect whatever the time
was when we bootstrapped the timehands via inittodr(9). This is usually
no more than a minute before the kernel threads are launched from main().
The prior behavior can be restored by adding a new interface to the
timecounter layer in a future commit.

Based on FreeBSD r303387.

Discussed with mpi@ and visa@.

ok visa@

show more ...


# a701e5df 18-Sep-2018 bluhm <bluhm@openbsd.org>

Updating time counters without memory barriers is wrong. Put
membar_producer() into tc_windup() and membar_consumer() into the
uptime functions. They order the visibility of the time and
generation

Updating time counters without memory barriers is wrong. Put
membar_producer() into tc_windup() and membar_consumer() into the
uptime functions. They order the visibility of the time and
generation number updates.
This is a combination of what NetBSD and FreeBSD do.
OK kettenis@

show more ...


1234