#
7b00fbb4 |
| 25-Oct-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Replace global vmobj_token with vmobj_tokens[] array
* Remove one of the two remaining major bottlenecks in the system, the global vmobj_token which is used to manage access to the vm_obj
kernel - Replace global vmobj_token with vmobj_tokens[] array
* Remove one of the two remaining major bottlenecks in the system, the global vmobj_token which is used to manage access to the vm_object_list. All VM object creation and deletion would get thrown into this list.
* Replace it with an array of 64 tokens and an array of 64 lists. vmobj_token[] and vm_object_lists[]. Use a simple right-shift hash code to index the array.
* This reduces contention by a factor of 64 or so which makes a big difference on multi-chip cpu systems. It won't be as noticable on single-chip (e.g. 4-core/8-thread) systems.
* Rip-out some of the linux vmstats compat functions which were iterating the object list and replace with the pcpu accumulator scan that was recently implemented for dragonfly vmstats.
* TODO: proc_token.
show more ...
|
#
2734d278 |
| 24-Oct-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - more SMP optimizations in the VM system
* imgact_elf - drop the vm_object a little earlier in load_section(), and use a shared object lock when iterating ELF segments.
* When starting a
kernel - more SMP optimizations in the VM system
* imgact_elf - drop the vm_object a little earlier in load_section(), and use a shared object lock when iterating ELF segments.
* When starting a vforked process use a shared process token to interlock the wait loop instead of an exclusive token. Also don't bother with the token if there's nothing to wait for.
* When forking, pre-assign lp2 thread's td_ucred.
* Remove the vp->v_object load check loop. It should not be possible for vp->v_object to change after being assigned as long as the vp is referenced.
* Replace most OBJ_DEAD tests with assertions that the flag is not set.
* Remove the VOLOCK/VOWANT vnode interlock. It shouldn't be possible for the vnode's object to change while the vnode is ref'd. This was a leftover from a long-ago time when vnodes were more persistent and could be recycled and race accessors.
This also removes vm_object_dead_sleep/wait and related code.
* When memory mapping a vnode object there is no need to formally hold and chain_wait the object. We can simply add a ref to it, because vnode objects cannot have backing chains.
* When deallocating a vm_object we can shortcut counts greater than 1 for OBJT_VNODE objects instead of counts greater than 3.
* Optimize vnode_pager_alloc(), avoiding unnecessary locks. Keep the temporary vnode token for the moment.
* Optimize vnode_pager_reference(), removing all locks from the path.
show more ...
|
#
501747bf |
| 12-Oct-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Greatly improve concurrent fork's and concurrent exec's
* Rewrite all the vm_fault*() API functions to use a two-stage methodology which keeps track of whether a shared or exclusive lock
kernel - Greatly improve concurrent fork's and concurrent exec's
* Rewrite all the vm_fault*() API functions to use a two-stage methodology which keeps track of whether a shared or exclusive lock is being used on fs.first_object and fs.object. For most VM faults a shared lock is sufficient, particularly under fork and exec circumstances.
If the shared lock is not sufficient the functions will back-down to an exclusive lock on either or both elements.
* Implement shared chain locks for use by the above.
* kern_exec - exec_map_page() now attempts to access the page with a shared lock first, and backs down to an exclusive lock if the page is not conveniently available.
* vm_object ref-counting now uses atomic ops across the board. The acquisition call can operate with a shared object lock. The deallocate call will optimize decrementation of ref_count for values above 3 using an atomic op without needing any lock at all.
* vm_map_split() and vm_object_collapse() and associated functions are now smart about handling terminal (e.g. OBJT_VNODE) VM objects and will use a shared lock when possible.
* When creating new shadow chains in front of a OBJT_VNODE object, we no longer enter those objects onto the OBJT_VNODE object's shadow_head. That is, only DEFAULT and SWAP objects need to track who might be shadowing them. TODO: This code needs to be cleaned up a bit though.
This removes another exclusive object lock from the critical path.
* vm_page_grab() will use a shared object lock when possible.
show more ...
|
#
f2c2051e |
| 29-Jul-2013 |
Johannes Hofmann <johannes.hofmann@gmx.de> |
kernel: Port new device_pager interface from FreeBSD
Some parts implemented by François Tigeot and Matthew Dillon
|
#
6ed30774 |
| 18-Jul-2013 |
François Tigeot <ftigeot@wolfpond.org> |
pat: Make the API more compatible with FreeBSD
|
#
b524ca76 |
| 18-Jul-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
PAT work, mapdev_attr, kmem_alloc_attr
Partially based on work by Aggelos Economopoulos <aoiko@cc.ece.ntua.gr>
|
#
adddfd62 |
| 05-Jul-2013 |
Sascha Wildner <saw@online.de> |
kernel: Remove some #include duplicates in vfs/ and vm/
|
#
b004e484 |
| 24-Feb-2013 |
Sascha Wildner <saw@online.de> |
kernel/vm_object: Add debugvm_object_hold_maybe_shared() prototype.
|
#
ce94514e |
| 23-Feb-2013 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Implementat much deeper use of shared VM object locks
* Use a shared VM object lock on terminal (and likely highly shared) OBJT_VNODE objects. For example, binaries in the system such as
kernel - Implementat much deeper use of shared VM object locks
* Use a shared VM object lock on terminal (and likely highly shared) OBJT_VNODE objects. For example, binaries in the system such as /bin/sh or /usr/bin/make.
This greatly improves fork/exec and related VM faults on concurrently executing binaries. Most commonly, parallel builds often exec hundreds of thousands of sh's and make's.
+50% to +100% nominal improved performance under these conditions. +200% to +300% improved poudriere performance during the depend stage.
* Formalize the shared VM object lock with a new API function, vm_object_lock_maybe_shared(), which determines whether a VM object meets the requirements for obtaining a shared lock.
* Adjust the vm_fault*() APIs to track whether the VM object is locked shared or exclusive on entry.
* Clarify that OBJ_ONEMAPPING is only applicable to OBJT_DEFAULT and OBJT_SWAP objects.
* Heavy work on the exec path. Somewhat lighter work on the exit path. Tons more work could be done.
show more ...
|
#
9a0c03af |
| 06-Aug-2012 |
François Tigeot <ftigeot@wolfpond.org> |
kernel: add VM_OBJECT_LOCK/UNLOCK macros
|
#
921c891e |
| 13-Sep-2012 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Implement segment pmap optimizations for x86-64
* Implement 2MB segment optimizations for x86-64. Any shared read-only or read-write VM object mapped into memory, including physical obje
kernel - Implement segment pmap optimizations for x86-64
* Implement 2MB segment optimizations for x86-64. Any shared read-only or read-write VM object mapped into memory, including physical objects (so both sysv_shm and mmap), which is a multiple of the segment size and segment-aligned can be optimized.
* Enable with sysctl machdep.pmap_mmu_optimize=1
Default is off for now. This is an experimental feature.
* It works as follows: A VM object which is large enough will, when VM faults are generated, store a truncated pmap (PD, PT, and PTEs) in the VM object itself.
VM faults whos vm_map_entry's can be optimized will cause the PTE, PT, and also the PD (for now) to be stored in a pmap embedded in the VM_OBJECT, instead of in the process pmap.
The process pmap then creates PT entry in the PD page table that points to the PT page table page stored in the VM_OBJECT's pmap.
* This removes nearly all page table overhead from fork()'d processes or even unrelated process which massively share data via mmap() or sysv_shm. We still recommend using sysctl kern.ipc.shm_use_phys=1 (which is now the default), which also removes the PV entries associated with the shared pmap. However, with this optimization PV entries are no longer a big issue since they will not be replicated in each process, only in the common pmap stored in the VM_OBJECT.
* Features of this optimization:
* Number of PV entries is reduced to approximately the number of live pages and no longer multiplied by the number of processes separately mapping the shared memory.
* One process faulting in a page naturally makes the PTE available to all other processes mapping the same shared memory. The other processes do not have to fault that same page in.
* Page tables survive process exit and restart.
* Once page tables are populated and cached, any new process that maps the shared memory will take far fewer faults because each fault will bring in an ENTIRE page table. Postgres w/ 64-clients, VM fault rate was observed to drop from 1M faults/sec to less than 500 at startup, and during the run the fault rates dropped from a steady decline into the hundreds of thousands into an instant decline to virtually zero VM faults.
* We no longer have to depend on sysv_shm to optimize the MMU.
* CPU caches will do a better job caching page tables since most of them are now themselves shared. Even when we invltlb, more of the page tables will be in the L1, L2, and L3 caches.
* EXPERIMENTAL!!!!!
show more ...
|
#
ad23467e |
| 23-May-2012 |
Sascha Wildner <saw@online.de> |
kernel: Remove some bogus casts to the own type.
|
#
a2ee730d |
| 02-Dec-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Refactor the vmspace locking code and several use cases
* Reorder the vnode ref/rele sequence in the exec path so p_textvp is left in a more valid state while being initialized.
* Removi
kernel - Refactor the vmspace locking code and several use cases
* Reorder the vnode ref/rele sequence in the exec path so p_textvp is left in a more valid state while being initialized.
* Removing the vm_exitingcnt test in exec_new_vmspace(). Release various resources unconditionally on the last exiting thread regardless of the state of exitingcnt. This just moves some of the resource releases out of the wait*() system call path and back into the exit*() path.
* Implement a hold/drop mechanic for vmspaces and use them in procfs_rwmem(), vmspace_anonymous_count(), and vmspace_swap_count(), and various other places.
This does a better job protecting the vmspace from deletion while various unrelated third parties might be trying to access it.
* Implement vmspace_free() for other code to call instead of them trying to call sysref_put() directly. Interlock with a vmspace_hold() so final termination processing always keys off the vm_holdcount.
* Implement vm_object_allocate_hold() and use it in a few places in order to allow OBJT_SWAP objects to be allocated atomically, so other third parties (like the swapcache cleaning code) can't wiggle their way in and access a partially initialized object.
* Reorder the vmspace_terminate() code and introduce some flags to ensure that resources are terminated at the proper time and in the proper order.
show more ...
|
#
609c9aae |
| 20-Nov-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix vm_object token deadlock (2)
* Files missed in original commit.
|
#
c9958a5a |
| 16-Nov-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Move VM objects from pool tokens to per-vm-object tokens
* Move VM objects from pool tokens to per-vm-object tokens.
* This fixes booting issues on i386 with vm.shared_fault=1 (pool toke
kernel - Move VM objects from pool tokens to per-vm-object tokens
* Move VM objects from pool tokens to per-vm-object tokens.
* This fixes booting issues on i386 with vm.shared_fault=1 (pool tokens would sometimes coincide with the token used for kernel_object which causes problems on i386 due to the pmap code's use of kernel_map/kernel_object).
show more ...
|
#
54341a3b |
| 15-Nov-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Greatly improve shared memory fault rate concurrency / shared tokens
This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes
kernel - Greatly improve shared memory fault rate concurrency / shared tokens
This commit rolls up a lot of work to improve postgres database operations and the system in general. With this changes we can pgbench -j 8 -c 40 on our 48-core opteron monster at 140000+ tps, and the shm vm_fault rate hits 3.1M pps.
* Implement shared tokens. They work as advertised, with some cavets.
It is acceptable to acquire a shared token while you already hold the same token exclusively, but you will deadlock if you acquire an exclusive token while you hold the same token shared.
Currently exclusive tokens are not given priority over shared tokens so starvation is possible under certain circumstances.
* Create a critical code path in vm_fault() using the new shared token feature to quickly fault-in pages which already exist in the VM cache. pmap_object_init_pt() also uses the new feature.
This increases fault-in concurrency by a ridiculously huge amount, particularly on SHM segments (say when you have a large number of postgres clients). Scaling for large numbers of clients on large numbers of cores is significantly improved.
This also increases fault-in concurrency for MAP_SHARED file maps.
* Expand the breadn() and cluster_read() APIs. Implement breadnx() and cluster_readx() which allows a getblk()'d bp to be passed. If *bpp is not NULL a bp is being passed in, otherwise the routines call getblk().
* Modify the HAMMER read path to use the new API. Instead of calling getcacheblk() HAMMER now calls getblk() and checks the B_CACHE flag. This gives getblk() a chance to regenerate a fully cached buffer from VM backing store without having to acquire any hammer-related locks, resulting in even faster operation.
* If kern.ipc.shm_use_phys is set to 2 the VM pages will be pre-allocated. This can take quite a while for a large map and also lock the machine up for a few seconds. Defaults to off.
* Reorder the smp_invltlb()/cpu_invltlb() combos in a few places, running cpu_invltlb() last.
* An invalidation interlock might be needed in pmap_enter() under certain circumstances, enable the code for now.
* vm_object_backing_scan_callback() was failing to properly check the validity of a vm_object after acquiring its token. Add the required check + some debugging.
* Make vm_object_set_writeable_dirty() a bit more cache friendly.
* The vmstats sysctl was scanning every process's vm_map (requiring a vm_map read lock to do so), which can stall for long periods of time when the system is paging heavily. Change the mechanic to a LWP flag which can be tested with minimal locking.
* Have the phys_pager mark the page as dirty too, to make sure nothing tries to free it.
* Remove the spinlock in pmap_prefault_ok(), since we do not delete page table pages it shouldn't be needed.
* Add a required cpu_ccfence() in pmap_inval.c. The code generated prior to this fix was still correct, and this makes sure it stays that way.
* Replace several manual wiring cases with calls to vm_page_wire().
show more ...
|
#
e806bedd |
| 27-Oct-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix deep recursion in vm_object_collapse()
* vm_object_collapse() will loop but its backing_object sometimes needs to be deallocated as well and this can trigger another collapse against
kernel - Fix deep recursion in vm_object_collapse()
* vm_object_collapse() will loop but its backing_object sometimes needs to be deallocated as well and this can trigger another collapse against a different parent object.
* Introduce vm_object_dealloc_list and friends to collect a list of objects requiring deallocation so the caller can run the list in a way that avoids a deep recursion.
Reported-by: juanfra
show more ...
|
#
b12defdc |
| 18-Oct-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Major SMP performance patch / VM system, bus-fault/seg-fault fixes
This is a very large patch which reworks locking in the entire VM subsystem, concentrated on VM objects and the x86-64 pma
kernel - Major SMP performance patch / VM system, bus-fault/seg-fault fixes
This is a very large patch which reworks locking in the entire VM subsystem, concentrated on VM objects and the x86-64 pmap code. These fixes remove nearly all the spin lock contention for non-threaded VM faults and narrows contention for threaded VM faults to just the threads sharing the pmap.
Multi-socket many-core machines will see a 30-50% improvement in parallel build performance (tested on a 48-core opteron), depending on how well the build parallelizes.
As part of this work a long-standing problem on 64-bit systems where programs would occasionally seg-fault or bus-fault for no reason has been fixed. The problem was related to races between vm_fault, the vm_object collapse code, and the vm_map splitting code.
* Most uses of vm_token have been removed. All uses of vm_spin have been removed. These have been replaced with per-object tokens and per-queue (vm_page_queues[]) spin locks.
Note in particular that since we still have the page coloring code the PQ_FREE and PQ_CACHE queues are actually many queues, individually spin-locked, resulting in very excellent MP page allocation and freeing performance.
* Reworked vm_page_lookup() and vm_object->rb_memq. All (object,pindex) lookup operations are now covered by the vm_object hold/drop system, which utilize pool tokens on vm_objects. Calls now require that the VM object be held in order to ensure a stable outcome.
Also added vm_page_lookup_busy_wait(), vm_page_lookup_busy_try(), vm_page_busy_wait(), vm_page_busy_try(), and other API functions which integrate the PG_BUSY handling.
* Added OBJ_CHAINLOCK. Most vm_object operations are protected by the vm_object_hold/drop() facility which is token-based. Certain critical functions which must traverse backing_object chains use a hard-locking flag and lock almost the entire chain as it is traversed to prevent races against object deallocation, collapses, and splits.
The last object in the chain (typically a vnode) is NOT locked in this manner, so concurrent faults which terminate at the same vnode will still have good performance. This is important e.g. for parallel compiles which might be running dozens of the same compiler binary concurrently.
* Created a per vm_map token and removed most uses of vmspace_token.
* Removed the mp_lock in sys_execve(). It has not been needed in a while.
* Add kmem_lim_size() which returns approximate available memory (reduced by available KVM), in megabytes. This is now used to scale up the slab allocator cache and the pipe buffer caches to reduce unnecessary global kmem operations.
* Rewrote vm_page_alloc(), various bits in vm/vm_contig.c, the swapcache scan code, and the pageout scan code. These routines were rewritten to use the per-queue spin locks.
* Replaced the exponential backoff in the spinlock code with something a bit less complex and cleaned it up.
* Restructured the IPIQ func/arg1/arg2 array for better cache locality. Removed the per-queue ip_npoll and replaced it with a per-cpu gd_npoll, which is used by other cores to determine if they need to issue an actual hardware IPI or not. This reduces hardware IPI issuance considerably (and the removal of the decontention code reduced it even more).
* Temporarily removed the lwkt thread fairq code and disabled a number of features. These will be worked back in once we track down some of the remaining performance issues.
Temproarily removed the lwkt thread resequencer for tokens for the same reason. This might wind up being permanent.
Added splz_check()s in a few critical places.
* Increased the number of pool tokens from 1024 to 4001 and went to a prime-number mod algorithm to reduce overlaps.
* Removed the token decontention code. This was a bit of an eyesore and while it did its job when we had global locks it just gets in the way now that most of the global locks are gone.
Replaced the decontention code with a fall back which acquires the tokens in sorted order, to guarantee that deadlocks will always be resolved eventually in the scheduler.
* Introduced a simplified spin-for-a-little-while function _lwkt_trytoken_spin() that the token code now uses rather than giving up immediately.
* The vfs_bio subsystem no longer uses vm_token and now uses the vm_object_hold/drop API for buffer cache operations, resulting in very good concurrency.
* Gave the vnode its own spinlock instead of sharing vp->v_lock.lk_spinlock, which fixes a deadlock.
* Adjusted all platform pamp.c's to handle the new main kernel APIs. The i386 pmap.c is still a bit out of date but should be compatible.
* Completely rewrote very large chunks of the x86-64 pmap.c code. The critical path no longer needs pmap_spin but pmap_spin itself is still used heavily, particularin the pv_entry handling code.
A per-pmap token and per-pmap object are now used to serialize pmamp access and vm_page lookup operations when needed.
The x86-64 pmap.c code now uses only vm_page->crit_count instead of both crit_count and hold_count, which fixes races against other parts of the kernel uses vm_page_hold().
_pmap_allocpte() mechanics have been completely rewritten to remove potential races. Much of pmap_enter() and pmap_enter_quick() has also been rewritten.
Many other changes.
* The following subsystems (and probably more) no longer use the vm_token or vmobj_token in critical paths:
x The swap_pager now uses the vm_object_hold/drop API instead of vm_token.
x mmap() and vm_map/vm_mmap in general now use the vm_object_hold/drop API instead of vm_token.
x vnode_pager
x zalloc
x vm_page handling
x vfs_bio
x umtx system calls
x vm_fault and friends
* Minor fixes to fill_kinfo_proc() to deal with process scan panics (ps) revealed by recent global lock removals.
* lockmgr() locks no longer support LK_NOSPINWAIT. Spin locks are unconditionally acquired.
* Replaced netif/e1000's spinlocks with lockmgr locks. The spinlocks were not appropriate owing to the large context they were covering.
* Misc atomic ops added
show more ...
|
#
a31129d8 |
| 14-Jul-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Add debugging and attempt to fix vm_prefault issue
* Add debugging assertions and attempt to fix a race in the vm_prefault code when running through backing_object chains.
* The fix may
kernel - Add debugging and attempt to fix vm_prefault issue
* Add debugging assertions and attempt to fix a race in the vm_prefault code when running through backing_object chains.
* The fix may be incomplete, we really need a way to determine whether any chain element has changed state during the scan. The generation count may be too excessive as it also covers vm_page insertions.
Reported-by: Peter Avalos <peter@theshell.com>
show more ...
|
#
00db03f1 |
| 15-Jun-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Adjust vm_object->paging_in_progress to use refcount API
* Adjust vm_object->paging_in_progress to use refcount API
* Fixes races related to release / wait which could stall a process.
|
#
18a4c8dc |
| 14-Jun-2011 |
Venkatesh Srinivas <me@endeavour.zapto.org> |
kernel -- vm_object DEBUG_LOCKS: Record file/line of vm_object holds.
|
#
e42208e6 |
| 07-Jun-2011 |
Sascha Wildner <saw@online.de> |
<vm/vm_object.h>: Some little style cleanup.
|
#
b4460ab3 |
| 07-Jun-2011 |
Venkatesh Srinivas <me@endeavour.zapto.org> |
kernel -- vm_object locking: Interlock vm_object work in vm_fault.c and vm_map.c with per-object token. Handle NULL objects for _hold and _drop.
|
#
feea37dc |
| 29-Mar-2011 |
Venkatesh Srinivas <me@endeavour.zapto.org> |
kernel -- vm_object hold debugging should not panic if the debug array overflows
If the debug array overflows, we lose the ability to test for object drops when we never established a hold. However
kernel -- vm_object hold debugging should not panic if the debug array overflows
If the debug array overflows, we lose the ability to test for object drops when we never established a hold. However the system keeps running.
Suggested-by: dillon
show more ...
|
#
cb443cbb |
| 27-Mar-2011 |
Venkatesh Srinivas <me@endeavour.zapto.org> |
kernel -- vm_object locking: DEBUG_LOCKS check for hold_wait vs hold deadlock
If a thread has a hold on a vm_object and enters hold_wait (via either vm_object_terminate or vm_object_collapse), it wi
kernel -- vm_object locking: DEBUG_LOCKS check for hold_wait vs hold deadlock
If a thread has a hold on a vm_object and enters hold_wait (via either vm_object_terminate or vm_object_collapse), it will wait forever for the hold count to hit 0. Record the threads holding an object in a per-object array.
show more ...
|