History log of /dflybsd-src/sys/platform/vkernel64/include/pmap_inval.h (Results 1 – 4 of 4)
Revision Date Author Comments
# 95270b7e 01-Feb-2017 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Many fixes for vkernel support, plus a few main kernel fixes

REAL KERNEL

* The big enchillada is that the main kernel's thread switch code has
a small timing window where it clears t

kernel - Many fixes for vkernel support, plus a few main kernel fixes

REAL KERNEL

* The big enchillada is that the main kernel's thread switch code has
a small timing window where it clears the PM_ACTIVE bit for the cpu
while switching between two threads. However, it *ALSO* checks and
avoids loading the %cr3 if the two threads have the same pmap.

This results in a situation where an invalidation on the pmap in another
cpuc may not have visibility to the cpu doing the switch, and yet the
cpu doing the switch also decides not to reload %cr3 and so does not
invalidate the TLB either. The result is a stale TLB and bad things
happen.

For now just unconditionally load %cr3 until I can come up with code
to handle the case.

This bug is very difficult to reproduce on a normal system, it requires
a multi-threaded program doing nasty things (munmap, etc) on one cpu
while another thread is switching to a third thread on some other cpu.

* KNOTE after handling the vkernel trap in postsig() instead of before.

* Change the kernel's pmap_inval_smp() code to take a 64-bit npgs
argument instead of a 32-bit npgs argument. This fixes situations
that crop up when a process uses more than 16TB of address space.

* Add an lfence to the pmap invalidation code that I think might be
needed.

* Handle some wrap/overflow cases in pmap_scan() related to the use of
large address spaces.

* Fix an unnecessary invltlb in pmap_clearbit() for unmanaged PTEs.

* Test PG_RW after locking the pv_entry to handle potential races.

* Add bio_crc to struct bio. This field is only used for debugging for
now but may come in useful later.

* Add some global debug variables in the pmap_inval_smp() and related
paths. Refactor the npgs handling.

* Load the tsc_target field after waiting for completion of the previous
invalidation op instead of before. Also add a conservative mfence()
in the invalidation path before loading the info fields.

* Remove the global pmap_inval_bulk_count counter.

* Adjust swtch.s to always reload the user process %cr3, with an
explanation. FIXME LATER!

* Add some test code to vm/swap_pager.c which double-checks that the page
being paged out does not get corrupted during the operation. This code
is #if 0'd.

* We must hold an object lock around the swp_pager_meta_ctl() call in
swp_pager_async_iodone(). I think.

* Reorder when PG_SWAPINPROG is cleared. Finish the I/O before clearing
the bit.

* Change the vm_map_growstack() API to pass a vm_map in instead of
curproc.

* Use atomic ops for vm_object->generation counts, since objects can be
locked shared.

VKERNEL

* Unconditionally save the FP state after returning from VMSPACE_CTL_RUN.
This solves a severe FP corruption bug in the vkernel due to calls it
makes into libc (which uses %xmm registers all over the place).

This is not a complete fix. We need a formal userspace/kernelspace FP
abstraction. Right now the vkernel doesn't have a kernelspace FP
abstraction so if a kernel thread switches preemptively bad things
happen.

* The kernel tracks and locks pv_entry structures to interlock pte's.
The vkernel never caught up, and does not really have a pv_entry or
placemark mechanism. The vkernel's pmap really needs a complete
re-port from the real-kernel pmap code. Until then, we use poor hacks.

* Use the vm_page's spinlock to interlock pte changes.

* Make sure that PG_WRITEABLE is set or cleared with the vm_page
spinlock held.

* Have pmap_clearbit() acquire the pmobj token for the pmap in the
iteration. This appears to be necessary, currently, as most of the
rest of the vkernel pmap code also uses the pmobj token.

* Fix bugs in the vkernel's swapu32() and swapu64().

* Change pmap_page_lookup() and pmap_unwire_pgtable() to fully busy
the page. Note however that a page table page is currently never
soft-busied. Also other vkernel code that busies a page table page.

* Fix some sillycode in a pmap->pm_ptphint test.

* Don't inherit e.g. PG_M from the previous pte when overwriting it
with a pte of a different physical address.

* Change the vkernel's pmap_clear_modify() function to clear VTPE_RW
(which also clears VPTE_M), and not just VPTE_M. Formally we want
the vkernel to be notified when a page becomes modified and it won't
be unless we also clear VPTE_RW and force a fault. <--- I may change
this back after testing.

* Wrap pmap_replacevm() with a critical section.

* Scrap the old grow_stack() code. vm_fault() and vm_fault_page() handle
it (vm_fault_page() just now got the ability).

* Properly flag VM_FAULT_USERMODE.

show more ...


# 79f2da03 15-Jul-2016 Matthew Dillon <dillon@apollo.backplane.com>

kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code

* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation

* Remove the old lwkt_ipi-based per-page inv

kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code

* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation

* Remove the old lwkt_ipi-based per-page invalidation code.

* Include Xinvltlb interrupts in the V_IPI statistics counter
(so they show up in systat -pv 1).

* Add loop counters to detect and log possible endless loops.

* (Fix single_apic_ipi_passive() but note that this function is currently
not used. Interrupts must be hard-disabled when checking icr_lo).

* NEW INVALIDATION MECHANISM

The new invalidation mechanism is primarily enclosed in mp_machdep.c and
pmap_inval.c. Supply new all-in-one rollup functions which include the
*ptep contents adjustment, instead of prior piecemeal functions.

The new mechanism uses Xinvltlb for both full-tlb and per-page
invalidations. This interrupt ignores critical sections (that is,
will operate even if kernel code is in a critical section), which
significantly improves the latency and stability of our pmap pte
invalidation support functions.

For example, prior to these changes the invalidation code uses the
lwkt_ipiq paths which are subject to critical sections and could result
in long stalls across substantially ALL cpus when one cpu was in a long
cpu-bound critical section.

* NEW SMP_INVLTLB() OPTIMIZATION

smp_invltlb() always used Xinvltlb, and it still does. However the
code now avoids IPIing idle cpus, instead flagging them to issue the
cpu_invltlb() call when they wake-up.

To make this work the idle code must temporarily enter a critical section
so 'normal' interrupts do not run until it has a chance to check and act
on the flag. This will slightly increase interrupt latency on an idle
cpu.

This change significantly improves smp_invltlb() overhead by avoiding
having to pull idle cpus out of their high-latency/low-power state. Thus
it also avoids the high latency on those cpus messing up.

* Remove unnecessary calls to smp_invltlb(). It is not necessary to call
this function when a *ptep is transitioning from 0 to non-zero. This
significantly cuts down on smp_invltlb() traffic under load.

* Remove a bunch of unused code in these paths.

* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down
counters which do one stack backtrace when they hit 0.

TIMING TESTS

No appreciable differences with the new code other than feeling smoother.

mount_tmpfs dummy /usr/obj

On monster (4-socket, 48-core):
time make -j 50 buildworld
BEFORE: 7849.697u 4693.979s 16:23.07 1275.9%
AFTER: 7682.598u 4467.224s 15:47.87 1281.8%

time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 927.608u 254.626s 1:36.01 1231.3%
AFTER: 531.124u 204.456s 1:25.99 855.4%

On 2 x E5-2620 (2-socket, 32-core):
time make -j 50 buildworld
BEFORE: 5750.042u 2291.083s 10:35.62 1265.0%
AFTER: 5694.573u 2280.078s 10:34.96 1255.9%

time make -j 50 nativekernel NO_MODULES=TRUE
BEFORE: 431.338u 84.458s 0:54.71 942.7%
AFTER: 414.962u 92.312s 0:54.75 926.5%
(time mostly spend in mkdep line and on final link)

Memory thread tests, 64 threads each allocating memory.

BEFORE: 3.1M faults/sec
AFTER: 3.1M faults/sec.

show more ...


# a86ce0cd 20-Sep-2013 Matthew Dillon <dillon@apollo.backplane.com>

hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree

* This merge contains work primarily by Mihai Carabas, with some misc
fixes also by Matthew Dillon.

* Special note on G

hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree

* This merge contains work primarily by Mihai Carabas, with some misc
fixes also by Matthew Dillon.

* Special note on GSOC core

This is, needless to say, a huge amount of work compressed down into a
few paragraphs of comments. Adds the pc64/vmm subdirectory and tons
of stuff to support hardware virtualization in guest-user mode, plus
the ability for programs (vkernels) running in this mode to make normal
system calls to the host.

* Add system call infrastructure for VMM mode operations in kern/sys_vmm.c
which vectors through a structure to machine-specific implementations.

vmm_guest_ctl_args()
vmm_guest_sync_addr_args()

vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original
user stack for EPT (since EPT 'physical' addresses cannot reach that far
into the backing store represented by the process's original VM space).
Also installs the GUEST_CR3 for the guest using parameters supplied by
the guest.

vmm_guest_sync_addr_args() - A host helper function that the vkernel can
use to invalidate page tables on multiple real cpus. This is a lot more
efficient than having the vkernel try to do it itself with IPI signals
via cpusync*().

* Add Intel VMX support to the host infrastructure. Again, tons of work
compressed down into a one paragraph commit message. Intel VMX support
added. AMD SVM support is not part of this GSOC and not yet supported
by DragonFly.

* Remove PG_* defines for PTE's and related mmu operations. Replace with
a table lookup so the same pmap code can be used for normal page tables
and also EPT tables.

* Also include X86_PG_V defines specific to normal page tables for a few
situations outside the pmap code.

* Adjust DDB to disassemble SVM related (intel) instructions.

* Add infrastructure to exit1() to deal related structures.

* Optimize pfind() and pfindn() to remove the global token when looking
up the current process's PID (Matt)

* Add support for EPT (double layer page tables). This primarily required
adjusting the pmap code to use a table lookup to get the PG_* bits.

Add an indirect vector for copyin, copyout, and other user address space
copy operations to support manual walks when EPT is in use.

A multitude of system calls which manually looked up user addresses via
the vm_map now need a VMM layer call to translate EPT.

* Remove the MP lock from trapsignal() use cases in trap().

* (Matt) Add pthread_yield()s in most spin loops to help situations where
the vkernel is running on more cpu's than the host has, and to help with
scheduler edge cases on the host.

* (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page()
uses to try to shortcut operations and avoid locks. Implement it for
pc64. This function checks whether the page is already faulted in as
requested by looking up the PTE. If not it returns NULL and the full
blown vm_fault_page() code continues running.

* (Matt) Remove the MP lock from most the vkernel's trap() code

* (Matt) Use a shared spinlock when possible for certain critical paths
related to the copyin/copyout path.

show more ...


# da673940 17-Aug-2009 Jordan Gordeev <jgordeev@dir.bg>

Add platform vkernel64.