#
95270b7e |
| 01-Feb-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Many fixes for vkernel support, plus a few main kernel fixes
REAL KERNEL
* The big enchillada is that the main kernel's thread switch code has a small timing window where it clears t
kernel - Many fixes for vkernel support, plus a few main kernel fixes
REAL KERNEL
* The big enchillada is that the main kernel's thread switch code has a small timing window where it clears the PM_ACTIVE bit for the cpu while switching between two threads. However, it *ALSO* checks and avoids loading the %cr3 if the two threads have the same pmap.
This results in a situation where an invalidation on the pmap in another cpuc may not have visibility to the cpu doing the switch, and yet the cpu doing the switch also decides not to reload %cr3 and so does not invalidate the TLB either. The result is a stale TLB and bad things happen.
For now just unconditionally load %cr3 until I can come up with code to handle the case.
This bug is very difficult to reproduce on a normal system, it requires a multi-threaded program doing nasty things (munmap, etc) on one cpu while another thread is switching to a third thread on some other cpu.
* KNOTE after handling the vkernel trap in postsig() instead of before.
* Change the kernel's pmap_inval_smp() code to take a 64-bit npgs argument instead of a 32-bit npgs argument. This fixes situations that crop up when a process uses more than 16TB of address space.
* Add an lfence to the pmap invalidation code that I think might be needed.
* Handle some wrap/overflow cases in pmap_scan() related to the use of large address spaces.
* Fix an unnecessary invltlb in pmap_clearbit() for unmanaged PTEs.
* Test PG_RW after locking the pv_entry to handle potential races.
* Add bio_crc to struct bio. This field is only used for debugging for now but may come in useful later.
* Add some global debug variables in the pmap_inval_smp() and related paths. Refactor the npgs handling.
* Load the tsc_target field after waiting for completion of the previous invalidation op instead of before. Also add a conservative mfence() in the invalidation path before loading the info fields.
* Remove the global pmap_inval_bulk_count counter.
* Adjust swtch.s to always reload the user process %cr3, with an explanation. FIXME LATER!
* Add some test code to vm/swap_pager.c which double-checks that the page being paged out does not get corrupted during the operation. This code is #if 0'd.
* We must hold an object lock around the swp_pager_meta_ctl() call in swp_pager_async_iodone(). I think.
* Reorder when PG_SWAPINPROG is cleared. Finish the I/O before clearing the bit.
* Change the vm_map_growstack() API to pass a vm_map in instead of curproc.
* Use atomic ops for vm_object->generation counts, since objects can be locked shared.
VKERNEL
* Unconditionally save the FP state after returning from VMSPACE_CTL_RUN. This solves a severe FP corruption bug in the vkernel due to calls it makes into libc (which uses %xmm registers all over the place).
This is not a complete fix. We need a formal userspace/kernelspace FP abstraction. Right now the vkernel doesn't have a kernelspace FP abstraction so if a kernel thread switches preemptively bad things happen.
* The kernel tracks and locks pv_entry structures to interlock pte's. The vkernel never caught up, and does not really have a pv_entry or placemark mechanism. The vkernel's pmap really needs a complete re-port from the real-kernel pmap code. Until then, we use poor hacks.
* Use the vm_page's spinlock to interlock pte changes.
* Make sure that PG_WRITEABLE is set or cleared with the vm_page spinlock held.
* Have pmap_clearbit() acquire the pmobj token for the pmap in the iteration. This appears to be necessary, currently, as most of the rest of the vkernel pmap code also uses the pmobj token.
* Fix bugs in the vkernel's swapu32() and swapu64().
* Change pmap_page_lookup() and pmap_unwire_pgtable() to fully busy the page. Note however that a page table page is currently never soft-busied. Also other vkernel code that busies a page table page.
* Fix some sillycode in a pmap->pm_ptphint test.
* Don't inherit e.g. PG_M from the previous pte when overwriting it with a pte of a different physical address.
* Change the vkernel's pmap_clear_modify() function to clear VTPE_RW (which also clears VPTE_M), and not just VPTE_M. Formally we want the vkernel to be notified when a page becomes modified and it won't be unless we also clear VPTE_RW and force a fault. <--- I may change this back after testing.
* Wrap pmap_replacevm() with a critical section.
* Scrap the old grow_stack() code. vm_fault() and vm_fault_page() handle it (vm_fault_page() just now got the ability).
* Properly flag VM_FAULT_USERMODE.
show more ...
|
#
79f2da03 |
| 15-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page inv
kernel - Refactor Xinvltlb and the pmap page & global tlb invalidation code
* Augment Xinvltlb to handle both TLB invalidation and per-page invalidation
* Remove the old lwkt_ipi-based per-page invalidation code.
* Include Xinvltlb interrupts in the V_IPI statistics counter (so they show up in systat -pv 1).
* Add loop counters to detect and log possible endless loops.
* (Fix single_apic_ipi_passive() but note that this function is currently not used. Interrupts must be hard-disabled when checking icr_lo).
* NEW INVALIDATION MECHANISM
The new invalidation mechanism is primarily enclosed in mp_machdep.c and pmap_inval.c. Supply new all-in-one rollup functions which include the *ptep contents adjustment, instead of prior piecemeal functions.
The new mechanism uses Xinvltlb for both full-tlb and per-page invalidations. This interrupt ignores critical sections (that is, will operate even if kernel code is in a critical section), which significantly improves the latency and stability of our pmap pte invalidation support functions.
For example, prior to these changes the invalidation code uses the lwkt_ipiq paths which are subject to critical sections and could result in long stalls across substantially ALL cpus when one cpu was in a long cpu-bound critical section.
* NEW SMP_INVLTLB() OPTIMIZATION
smp_invltlb() always used Xinvltlb, and it still does. However the code now avoids IPIing idle cpus, instead flagging them to issue the cpu_invltlb() call when they wake-up.
To make this work the idle code must temporarily enter a critical section so 'normal' interrupts do not run until it has a chance to check and act on the flag. This will slightly increase interrupt latency on an idle cpu.
This change significantly improves smp_invltlb() overhead by avoiding having to pull idle cpus out of their high-latency/low-power state. Thus it also avoids the high latency on those cpus messing up.
* Remove unnecessary calls to smp_invltlb(). It is not necessary to call this function when a *ptep is transitioning from 0 to non-zero. This significantly cuts down on smp_invltlb() traffic under load.
* Remove a bunch of unused code in these paths.
* Add machdep.report_invltlb_src and machdep.report_invlpg_src, down counters which do one stack backtrace when they hit 0.
TIMING TESTS
No appreciable differences with the new code other than feeling smoother.
mount_tmpfs dummy /usr/obj
On monster (4-socket, 48-core): time make -j 50 buildworld BEFORE: 7849.697u 4693.979s 16:23.07 1275.9% AFTER: 7682.598u 4467.224s 15:47.87 1281.8%
time make -j 50 nativekernel NO_MODULES=TRUE BEFORE: 927.608u 254.626s 1:36.01 1231.3% AFTER: 531.124u 204.456s 1:25.99 855.4%
On 2 x E5-2620 (2-socket, 32-core): time make -j 50 buildworld BEFORE: 5750.042u 2291.083s 10:35.62 1265.0% AFTER: 5694.573u 2280.078s 10:34.96 1255.9%
time make -j 50 nativekernel NO_MODULES=TRUE BEFORE: 431.338u 84.458s 0:54.71 942.7% AFTER: 414.962u 92.312s 0:54.75 926.5% (time mostly spend in mkdep line and on final link)
Memory thread tests, 64 threads each allocating memory.
BEFORE: 3.1M faults/sec AFTER: 3.1M faults/sec.
show more ...
|
#
a86ce0cd |
| 20-Sep-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree
* This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon.
* Special note on G
hammer2 - Merge Mihai Carabas's VKERNEL/VMM GSOC project into the main tree
* This merge contains work primarily by Mihai Carabas, with some misc fixes also by Matthew Dillon.
* Special note on GSOC core
This is, needless to say, a huge amount of work compressed down into a few paragraphs of comments. Adds the pc64/vmm subdirectory and tons of stuff to support hardware virtualization in guest-user mode, plus the ability for programs (vkernels) running in this mode to make normal system calls to the host.
* Add system call infrastructure for VMM mode operations in kern/sys_vmm.c which vectors through a structure to machine-specific implementations.
vmm_guest_ctl_args() vmm_guest_sync_addr_args()
vmm_guest_ctl_args() - bootstrap VMM and EPT modes. Copydown the original user stack for EPT (since EPT 'physical' addresses cannot reach that far into the backing store represented by the process's original VM space). Also installs the GUEST_CR3 for the guest using parameters supplied by the guest.
vmm_guest_sync_addr_args() - A host helper function that the vkernel can use to invalidate page tables on multiple real cpus. This is a lot more efficient than having the vkernel try to do it itself with IPI signals via cpusync*().
* Add Intel VMX support to the host infrastructure. Again, tons of work compressed down into a one paragraph commit message. Intel VMX support added. AMD SVM support is not part of this GSOC and not yet supported by DragonFly.
* Remove PG_* defines for PTE's and related mmu operations. Replace with a table lookup so the same pmap code can be used for normal page tables and also EPT tables.
* Also include X86_PG_V defines specific to normal page tables for a few situations outside the pmap code.
* Adjust DDB to disassemble SVM related (intel) instructions.
* Add infrastructure to exit1() to deal related structures.
* Optimize pfind() and pfindn() to remove the global token when looking up the current process's PID (Matt)
* Add support for EPT (double layer page tables). This primarily required adjusting the pmap code to use a table lookup to get the PG_* bits.
Add an indirect vector for copyin, copyout, and other user address space copy operations to support manual walks when EPT is in use.
A multitude of system calls which manually looked up user addresses via the vm_map now need a VMM layer call to translate EPT.
* Remove the MP lock from trapsignal() use cases in trap().
* (Matt) Add pthread_yield()s in most spin loops to help situations where the vkernel is running on more cpu's than the host has, and to help with scheduler edge cases on the host.
* (Matt) Add a pmap_fault_page_quick() infrastructure that vm_fault_page() uses to try to shortcut operations and avoid locks. Implement it for pc64. This function checks whether the page is already faulted in as requested by looking up the PTE. If not it returns NULL and the full blown vm_fault_page() code continues running.
* (Matt) Remove the MP lock from most the vkernel's trap() code
* (Matt) Use a shared spinlock when possible for certain critical paths related to the copyin/copyout path.
show more ...
|