#
b272101a |
| 30-Oct-2023 |
Aaron LI <aly@aaronly.me> |
Various minor whitespace cleanups
Accumulated along the way.
|
#
09195ea1 |
| 06-Nov-2023 |
Aaron LI <aly@aaronly.me> |
mbuf(9): Remove obsolete and unused 'kern.ipc.mbuf_wait' sysctl
This sysctl MIB has been obsolete and unused since the re-implementation of mbuf allocation using objcache(9) in commit 7b6f875 (year
mbuf(9): Remove obsolete and unused 'kern.ipc.mbuf_wait' sysctl
This sysctl MIB has been obsolete and unused since the re-implementation of mbuf allocation using objcache(9) in commit 7b6f875 (year 2005). Remove this sysctl MIB.
Update the mbuf.9 manpage about the 'how' argument to avoid ambiguity, i.e., MGET()/m_get() etc. would not fail if how=M_WAITOK.
show more ...
|
Revision tags: v6.4.0, v6.4.0rc1, v6.5.0, v6.2.2, v6.2.1, v6.3.0, v6.0.1 |
|
#
45dd33f2 |
| 01-Jul-2021 |
Aaron LI <aly@aaronly.me> |
kernel: Detect NVMM hypervisor
Now sysctl 'kern.vmm_guest' would report 'nvmm' instead of 'unknown' when running in NVMM hypervisor.
Suggested-by: swildner
|
Revision tags: v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1, v5.8.0, v5.9.0, v5.8.0rc1 |
|
#
880fb308 |
| 14-Feb-2020 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Offset the stathz systimer by 50% of the hz timer
* Offset the initial starting point of the stathz systimer by 50% of the hz timer, so they do not interfere with each other if they hap
kernel - Offset the stathz systimer by 50% of the hz timer
* Offset the initial starting point of the stathz systimer by 50% of the hz timer, so they do not interfere with each other if they happen to be set to the same frequency.
* Change the default stathz frequency to hz + 1 (101hz) so it slides across the tick interval window.
show more ...
|
Revision tags: v5.6.3, v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0, v5.4.3 |
|
#
4837705e |
| 03-May-2019 |
Matthew Dillon <dillon@apollo.backplane.com> |
pthreads and kernel - change MAP_STACK operation
* Only allow new mmap()'s to intrude on ungrown MAP_STACK areas when MAP_TRYFIXED is specified. This was not working as intended before. Adjust
pthreads and kernel - change MAP_STACK operation
* Only allow new mmap()'s to intrude on ungrown MAP_STACK areas when MAP_TRYFIXED is specified. This was not working as intended before. Adjust the manual page to be more clear.
* Make kern.maxssiz (the maximum user stack size) visible via sysctl.
* Add kern.maxthrssiz, indicating the approximate space for placement of pthread stacks. This defaults to 128GB.
* The old libthread_xu stack code did use TRYFIXED and will work with the kernel changes, but change how it works to not assume that the user stack should suddenly be limited to the pthread stack size (~2MB).
Just use a normal mmap() now without TRYFIXED and a hint based on kern.maxthrssiz (defaults to 512GB), calculating a starting address hint that is (_usrstack - maxssiz - maxthrssiz).
* Adjust procfs to report MAP_STACK segments as 'STK'.
show more ...
|
Revision tags: v5.4.2, v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1 |
|
#
6a8bb22d |
| 05-May-2018 |
Sascha Wildner <saw@online.de> |
Fix a few typos across the tree.
|
Revision tags: v5.2.0, v5.3.0, v5.2.0rc, v5.0.2, v5.0.1, v5.0.0, v5.0.0rc2, v5.1.0, v5.0.0rc1 |
|
#
9c79791a |
| 14-Aug-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Increase flock() / posix-lock limit
* Change the scaling for kern.maxposixlocksperuid to be based on maxproc instead of maxusers.
* Increase the default cap to approximately 4 * maxproc.
kernel - Increase flock() / posix-lock limit
* Change the scaling for kern.maxposixlocksperuid to be based on maxproc instead of maxusers.
* Increase the default cap to approximately 4 * maxproc. This can lead to what may seem to be large numbers, but certain applications might use posix locks heavily and overhead is low. We generally want the cap to be large enough that it never has to be overridden.
show more ...
|
#
bb471226 |
| 12-Aug-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Change maxproc cap calculation
* Increase the calculation for the maxproc cap based on physical ram. This allows a machine with 128GB of ram to maxproc past a million, though it should
kernel - Change maxproc cap calculation
* Increase the calculation for the maxproc cap based on physical ram. This allows a machine with 128GB of ram to maxproc past a million, though it should be noted that PIDs are only 6-digits, so for now a million processes is the actual limit.
show more ...
|
#
e6b81333 |
| 12-Aug-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Fix bottlenecks that develop when many processes are running
* When a large number of processes or threads are running (in the tens of thousands or more), a number of O(n) or O(ncpus) bot
kernel - Fix bottlenecks that develop when many processes are running
* When a large number of processes or threads are running (in the tens of thousands or more), a number of O(n) or O(ncpus) bottlenecks can develop. These bottlenecks do not develop when only a few thousand threads are present.
By fixing these bottlenecks, and assuming kern.maxproc is autoconfigured or manually set high enough, DFly can now handle hundreds of thousands of active processes running, polling, sleeping, whatever.
Tested to around 400,000 discrete processes (no shared VM pages) on a 32-thread dual-socket Xeon system. Each process is placed in a 1/10 second sleep loop using umtx timeouts:
baseline - (before changes), system bottlenecked starting at around the 30,000 process mark, eating all available cpu, high IPI rate from hash collisions, and other unrelated user processes bogged down due to the scheduling overhead.
200,000 processes - System settles down to 45% idle, and low IPI rate.
220,000 processes - System 30% idle and low IPI rate
250,000 processes - System 0% idle and low IPI rate
300,000 processes - System 0% idle and low IPI rate.
400,000 processes - Scheduler begins to bottleneck again after the 350,000 while the process test is still in its fork/exec loop.
Once all 400,000 processes are settled down, system behaves fairly well. 0% idle, modest IPI rate averaging 300 IPI/sec/cpu (due to hash collisions in the wakeup code).
* More work will be needed to better handle processes with massively shared VM pages.
It should also be noted that the system does a *VERY* good job allocating and releasing kernel resources during this test using discrete processes. It can kill 400,000 processes in a few seconds when I ^C the test.
* Change lwkt_enqueue()'s linear td_runq scan into a double-ended scan. This bottleneck does not arise when large numbers of processes are running in usermode, because typically only one user process per cpu will be scheduled to LWKT.
However, this bottleneck does arise when large numbers of threads are woken up in-kernel. While in-kernel, a thread schedules directly to LWKT. Round-robin operation tends to result in appends to the tail of the queue, so this optimization saves an enormous amount of cpu time when large numbers of threads are present.
* Limit ncallout to ~5 minutes worth of ring. The calculation code is primarily designed to allocate less space on low-memory machines, but will also cause an excessively-sized ring to be allocated on large-memory machines. 512MB was observed on a 32-way box.
* Remove vm_map->hint, which had basically stopped functioning in a useful manner. Add a new vm_map hinting mechanism that caches up to four (size, align) start addresses for vm_map_findspace(). This cache is used to quickly index into the linear vm_map_entry list before entering the linear search phase.
This fixes a serious bottleneck that arises due to vm_map_findspace()'s linear scan if the vm_map_entry list when the kernel_map becomes fragmented, typically when the machine is managing a large number of processes or threads (in the tens of thousands or more).
This will also reduce overheads for processes with highly fragmented vm_maps.
* Dynamically size the action_hash[] array in vm/vm_page.c. This array is used to record blocked umtx operations. The limited size of the array could result in an excessive number of hash entries when a large number of processes/threads are present in the system. Again, the effect is noticed as the number of threads exceeds a few tens of thousands.
show more ...
|
Revision tags: v4.8.1, v4.8.0, v4.6.2, v4.9.0, v4.8.0rc |
|
#
6de4cf8a |
| 23-Jan-2017 |
Matthew Dillon <dillon@apollo.backplane.com> |
vkernel - change hz default, optimize systimer
* Change the hz default to 50
* Refactor the vkernel's systimer code to reduce unnecessary signaling.
* Cleanup kern_clock.c a bit, including renamin
vkernel - change hz default, optimize systimer
* Change the hz default to 50
* Refactor the vkernel's systimer code to reduce unnecessary signaling.
* Cleanup kern_clock.c a bit, including renaming HZ to HZ_DEFAULT to avoid confusion.
show more ...
|
#
534ee349 |
| 28-Dec-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Implement RLIMIT_RSS, Increase maximum supported swap
* Implement RLIMIT_RSS by forcing pages out to swap if a process's RSS exceeds the rlimit. Currently the algorith used to choose the
kernel - Implement RLIMIT_RSS, Increase maximum supported swap
* Implement RLIMIT_RSS by forcing pages out to swap if a process's RSS exceeds the rlimit. Currently the algorith used to choose the pages is fairly unsophisticated (we don't have the luxury of a per-process vm_page_queues[] array).
* Implement the swap_user_async sysctl, default off. This sysctl can be set to 1 to enable asynchronous paging in the RSS code. This is mostly for testing and is not recommended since it allows the process to eat memory more quickly than it can be paged out.
* Reimplement vm.swap_burst_read so the sysctl now specifies the number of pages that are allowed to be burst. Still disabled by default (will be enabled in a followup commit).
* Fix an overflow in the nswap_lowat and nswap_hiwat calculations.
* Refactor some of the pageout code to support synchronous direct paging, which the RSS code uses. Thew new code also implements a feature that will move clean pages to PQ_CACHE, making them immediately reallocatable.
* Refactor the vm_pageout_deficit variable, using atomic ops.
* Fix an issue in vm_pageout_clean() (originally part of the inactive scan) which prevented clustering from operating properly on write.
* Refactor kern/subr_blist.c and all associated code that uses to increase swblk_t from int32_t to int64_t, and to increase the radix supported from 31 bits to 63 bits.
This increases the maximum supported swap from 2TB to some ungodly large value. Remember that, by default, space for up to 4 swap devices is preallocated so if you are allocating insane amounts of swap it is best to do it with four equal-sized partitions instead of one so kernel memory is efficiently allocated.
* There are two kernel data structures associated with swap. The blmeta structure which has approximately a 1:8192 ratio (ram:swap) and is pre-allocated up-front, and the swmeta structure whos KVA is reserved but not allocated.
The swmeta structure has a 1:341 ratio. It tracks swap assignments for pages in vm_object's. The kernel limits the number of structures to approximately half of physical memory, meaning that if you have a machine with 16GB of ram the maximum amount of swapped-out data you can support with that is 16/2*341 = 2.7TB. Not that you would actually want to eat half your ram to do actually do that.
A large system with, say, 128GB of ram, would be able to support 128/2*341 = 21TB of swap. The ultimate limitation is the 512GB of KVM. The swap system can use up to 256GB of this so the maximum swap currently supported by DragonFly on a machine with > 512GB of ram is going to be 256/2*341 = 43TB. To expand this further would require some adjustments to increase the amount of KVM supported by the kernel.
* WARNING! swmeta is allocated via zalloc(). Once allocated, the memory can be reused for swmeta but cannot be freed for use by other subsystems. You should only configure as much swap as you are willing to reserve ram for.
show more ...
|
Revision tags: v4.6.1, v4.6.0, v4.6.0rc2, v4.6.0rc, v4.7.0 |
|
#
2f0acc22 |
| 17-Jul-2016 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - Improve physio performance
* See http://apollo.backplane.com/DFlyMisc/nvme_sys03.txt
* Hash the pbuf system. This chops down spin-lock collisions at high transaction rates (>150K IOPS)
kernel - Improve physio performance
* See http://apollo.backplane.com/DFlyMisc/nvme_sys03.txt
* Hash the pbuf system. This chops down spin-lock collisions at high transaction rates (>150K IOPS) by 1000x.
* Implement a pbuf with pre-allocated kernel memory that we copy into, avoiding page table manipulations and thus avoiding system-wide invltlb/invlpg IPIs.
* This increases NVMe IOPS tests with three cards from 150K-200K IOPS to 950K IOPS using physio (random read, 4K blocks, from urandom-filled partition, with many process threads, from 3 NVMe cards in parallel).
* Further adjustments to the vkernel build.
show more ...
|
#
d26086c2 |
| 12-Jun-2016 |
Sepherosa Ziehau <sephe@dragonflybsd.org> |
kern: Update virtual machine detection a bit
Obtained-from: FreeBSD (partial)
|
Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1, v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc, v4.0.5, v4.0.4, v4.0.3, v4.0.2, v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2, v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2 |
|
#
cb963641 |
| 05-Mar-2014 |
Sascha Wildner <saw@online.de> |
kernel: Adjust type of vmm_guest to enum vmm_guest_type.
Also rename VMM_LAST -> VMM_GUEST_LAST.
|
#
5a1223c1 |
| 04-Mar-2014 |
Sascha Wildner <saw@online.de> |
kernel: Add more detailed VM detection.
Previously, the kernel global 'vmm_guest' was either 0 or 1 depending on the VMM bit in the CPU features (which isn't set in all VMs).
This commit adds more
kernel: Add more detailed VM detection.
Previously, the kernel global 'vmm_guest' was either 0 or 1 depending on the VMM bit in the CPU features (which isn't set in all VMs).
This commit adds more detailed information by checking the emulated BIOS for known strings. The detected VMs include vkernel, which doesn't strictly fit into the category, but it shares enough similarities for this to be useful.
Also expose this information in a read-only sysctl (kern.vmm_guest).
The detection code was kind of adapted from FreeBSD (although their kern.vm_guest works differently).
Tested-by: tuxillo
show more ...
|
Revision tags: v3.6.1, v3.6.0, v3.7.1, v3.6.0rc, v3.4.3 |
|
#
dc71b7ab |
| 31-May-2013 |
Justin C. Sherrill <justin@shiningsilence.com> |
Correct BSD License clause numbering from 1-2-4 to 1-2-3.
Apparently everyone's doing it: http://svnweb.freebsd.org/base?view=revision&revision=251069
Submitted-by: "Eitan Adler" <lists at eitanadl
Correct BSD License clause numbering from 1-2-4 to 1-2-3.
Apparently everyone's doing it: http://svnweb.freebsd.org/base?view=revision&revision=251069
Submitted-by: "Eitan Adler" <lists at eitanadler.com>
show more ...
|
Revision tags: v3.4.2 |
|
#
2702099d |
| 06-May-2013 |
Justin C. Sherrill <justin@shiningsilence.com> |
Remove advertising clause from all that isn't contrib or userland bin.
By: Eitan Adler <lists@eitanadler.com>
|
Revision tags: v3.4.1, v3.4.0, v3.4.0rc, v3.5.0, v3.2.2, v3.2.1, v3.2.0, v3.3.0 |
|
#
a0264b24 |
| 16-Sep-2012 |
Matthew Dillon <dillon@apollo.backplane.com> |
Merge branches 'hammer2' and 'master' of ssh://crater.dragonflybsd.org/repository/git/dragonfly into hammer2
|
#
74d62460 |
| 15-Sep-2012 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - remove bounds on buffer cache nbuf count for 64-bit
* Remove arbitrary 1GB buffer cache limitation
* Adjusted numerous 'int' fields to 'long'. Even though nbuf is not likely to exceed 2
kernel - remove bounds on buffer cache nbuf count for 64-bit
* Remove arbitrary 1GB buffer cache limitation
* Adjusted numerous 'int' fields to 'long'. Even though nbuf is not likely to exceed 2 billion buffers, byte calculations using the variable began overflowing so just convert that and various other variables to long.
* Make sure we don't blow-out the temporary valloc() space in early boot due to nbufs being too large.
* Unbound 'kern.nbuf' specifications in /boot/loader.conf as well.
show more ...
|
Revision tags: v3.0.3 |
|
#
9437e5dc |
| 31-May-2012 |
Matthew Dillon <dillon@apollo.backplane.com> |
Merge branches 'hammer2' and 'master' of ssh://crater.dragonflybsd.org/repository/git/dragonfly into hammer2
|
#
08771751 |
| 24-May-2012 |
Venkatesh Srinivas <me@endeavour.zapto.org> |
kernel -- CLFLUSH support
* Introduce a kernel variable, 'vmm_guest', signifying whether the kernel is running in a virtual environment, such as KVM. This is set based on the CPUID2.VMM flag on
kernel -- CLFLUSH support
* Introduce a kernel variable, 'vmm_guest', signifying whether the kernel is running in a virtual environment, such as KVM. This is set based on the CPUID2.VMM flag on kernels and set automatically on virtual kernels.
* Introduce wrappers for CLFLUSH instructions.
* Provide tunable, hw.clflush_enable, to autoenable CLFLUSH on h/w (-1) disable always (0), or enable always (1).
Closes-bug: 2363 Reviewed-by: ftigeot@ From: David Shao, FreeBSD
show more ...
|
Revision tags: v3.0.2, v3.0.1, v3.1.0, v3.0.0 |
|
#
86d7f5d3 |
| 26-Nov-2011 |
John Marino <draco@marino.st> |
Initial import of binutils 2.22 on the new vendor branch
Future versions of binutils will also reside on this branch rather than continuing to create new binutils branches for each new version.
|
#
50eff085 |
| 02-Nov-2011 |
Matthew Dillon <dillon@apollo.backplane.com> |
kernel - reformulate the maxusers auto-sizing calculation
* Reformulate the maxusers auto-sizing calculation, which is used as a basis for mbufs and mbuf cluster calculations. Base the values on
kernel - reformulate the maxusers auto-sizing calculation
* Reformulate the maxusers auto-sizing calculation, which is used as a basis for mbufs and mbuf cluster calculations. Base the values on limsize (basically the lower of KVM vs physical memory).
* Remove artificial limits.
* This basically effects x86-64 systems with > 4G of ram, greatly increasing the default maxusers value and related mbuf limits.
show more ...
|
Revision tags: v2.12.0, v2.13.0, v2.10.1, v2.11.0, v2.10.0, v2.9.1, v2.8.2, v2.8.1, v2.8.0, v2.9.0, v2.6.3, v2.7.3, v2.6.2, v2.7.2, v2.7.1, v2.6.1, v2.7.0, v2.6.0, v2.5.1, v2.4.1, v2.5.0, v2.4.0 |
|
#
79634a66 |
| 12-Aug-2009 |
Matthew Dillon <dillon@apollo.backplane.com> |
swap, amd64 - increase maximum swap space to 1TB x 4
* The radix can overflow a 32 bit integer even if swblk_t fits in 32 bits. Expand the radix to 64 bits and thus allow the subr_blist code to op
swap, amd64 - increase maximum swap space to 1TB x 4
* The radix can overflow a 32 bit integer even if swblk_t fits in 32 bits. Expand the radix to 64 bits and thus allow the subr_blist code to operate up to 2 billion blocks (8TB total).
* Shortcut the common single-swap-device case. We do not have to scan the radix tree to get available space in the single-device case.
* Change maxswzone and maxbcache to longs and add TUNABLE_LONG_FETCH().
* All the TUNEABLE_*_FETCH() calls and kgetenv_*() calls for integers call kgetenv_quad().
Adjust kgetenv_quad() to accept a suffix for kilobytes, megabytes, gigabytes, and terrabytes.
show more ...
|
Revision tags: v2.3.2, v2.3.1, v2.2.1, v2.2.0, v2.3.0, v2.1.1, v2.0.1 |
|
#
a46fac56 |
| 26-Jun-2005 |
Matthew Dillon <dillon@dragonflybsd.org> |
Move more scheduler-specific defines from various places into usched_bsd4.c and revamp our scheduler algorithms.
* Get rid of the multiple layers of abstraction in the nice and frequency calculati
Move more scheduler-specific defines from various places into usched_bsd4.c and revamp our scheduler algorithms.
* Get rid of the multiple layers of abstraction in the nice and frequency calculations.
* Increase the scheduling freqency from 20hz to 50hz.
* Greatly reduce the number of scheduling ticks that occur before a reschedule is issued.
* Fix a bug where the scheduler was not rescheduling when estcpu drops a process into a lower priority queue.
* Implement a new batch detection algorithm. This algorithm gives forked children slightly more batchness then their parents (which is recovered quickly if the child is interactive), and propogates estcpu data from exiting children to future forked children (which handles fork/exec/wait loops such as used by make, scripts, etc).
* Change the way NICE priorities effect process execution. The NICE value is used in two ways: First, it determines the initial process priority. The estcpu will have a tendancy to correct for this so the NICE value is also used to control estcpu's decay rate.
This means that niced processes will have both an initial penalty for startup and stabilization, and an ongoing penalty if they become cpu bound.
Examples from cpu-bound loops:
CPU PRI NI PID %CPU TIME COMMAND 42 159 -20 706 20.5 0:38.88 /bin/csh /tmp/dowhile 37 159 -15 704 17.6 0:35.09 /bin/csh /tmp/dowhile 29 157 -10 702 15.3 0:30.41 /bin/csh /tmp/dowhile 28 160 -5 700 13.0 0:26.73 /bin/csh /tmp/dowhile 23 160 0 698 11.5 0:20.52 /bin/csh /tmp/dowhile 18 160 5 696 9.2 0:16.85 /bin/csh /tmp/dowhile 13 160 10 694 7.1 0:10.82 /bin/csh /tmp/dowhile 3 160 20 692 1.5 0:02.14 /bin/csh /tmp/dowhile
show more ...
|