1.\" Copyright (c) 2001 Matthew Dillon. Terms and conditions are those of 2.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in 3.\" the source tree. 4.\" 5.Dd June 12, 2016 6.Dt TUNING 7 7.Os 8.Sh NAME 9.Nm tuning 10.Nd performance tuning under DragonFly 11.Sh SYSTEM SETUP 12Modern 13.Dx 14systems typically have just three partitions on the main drive. 15In order, a UFS 16.Pa /boot , 17.Pa swap , 18and a HAMMER 19.Pa root . 20The installer used to create separate PFSs for half a dozen directories, 21but now it just puts (almost) everything in the root. 22It will separate stuff that doesn't need to be backed up into a /build 23subdirectory and create null-mounts for things like /usr/obj, but it 24no longer creates separate PFSs for these. 25If desired, you can make /build its own mount to separate-out the 26components of the filesystem which do not need to be persistent. 27.Pp 28Generally speaking the 29.Pa /boot 30partition should be 1GB in size. This is the minimum recommended 31size, giving you room for backup kernels and alternative boot schemes. 32.Dx 33always installs debug-enabled kernels and modules and these can take 34up quite a bit of disk space (but will not take up any extra ram). 35.Pp 36In the old days we recommended that swap be sized to at least 2x main 37memory. These days swap is often used for other activities, including 38.Xr tmpfs 5 39and 40.Xr swapcache 8 . 41We recommend that swap be sized to the larger of 2x main memory or 421GB if you have a fairly small disk and up to 16GB if you have a 43moderately endowed system and a large drive. 44Or even larger if you have a SSD+HDD system in order to use swapcache. 45If you are on a minimally configured machine you may, of course, 46configure far less swap or no swap at all but we recommend at least 47some swap. 48The kernel's VM paging algorithms are tuned to perform best when there is 49at least 2x swap versus main memory. 50Configuring too little swap can lead to inefficiencies in the VM 51page scanning code as well as create issues later on if you add 52more memory to your machine. 53Swap is a good idea even if you don't think you will ever need it as it 54allows the 55machine to page out completely unused data from idle programs (like getty), 56maximizing the ram available for your activities. 57.Pp 58If you intend to use the 59.Xr swapcache 8 60facility with a SSD we recommend the SSD be configured with at 61least a 32G swap partition. 62If you are on a moderately well configured 64-bit system you can 63size swap even larger. 64Keep in mind that each 1GByte of swapcache requires around 1MByte of 65ram. 66.Pp 67Finally, on larger systems with multiple drives, if the use 68of SSD swap is not in the cards or if it is and you need higher-than-normal 69swapcache bandwidth, you can configure swap on up to four drives and 70the kernel will interleave the storage. 71The swap partitions on the drives should be approximately the same size. 72The kernel can handle arbitrary sizes but 73internal data structures scale to 4 times the largest swap partition. 74Keeping 75the swap partitions near the same size will allow the kernel to optimally 76stripe swap space across the N disks. 77Do not worry about overdoing it a 78little, swap space is the saving grace of 79.Ux 80and even if you do not normally use much swap, it can give you more time to 81recover from a runaway program before being forced to reboot. 82However, keep in mind that any sort of swap space failure can lock the 83system up. 84Most machines are setup with only one or two swap partitions. 85.Pp 86Most 87.Dx 88systems have a single HAMMER root. 89PFSs can be used to administratively separate domains for backup purposes 90but tend to be a hassle otherwise so if you don't need the administrative 91separation you don't really need to use multiple HAMMER PFSs. 92All the PFSs share the same allocation layer so there is no longer a need 93to size each individual mount. 94Instead you should review the 95.Xr hammer 8 96manual page and use the 'hammer viconfig' facility to adjust snapshot 97retention and other parameters. 98By default 99HAMMER keeps 60 days worth of snapshots. 100Usually snapshots are not desired on PFSs such as 101.Pa /usr/obj 102or 103.Pa /tmp 104since data on these partitions cycles a lot. 105.Pp 106If a very large work area is desired it is often beneficial to 107configure it as a separate HAMMER mount. If it is integrated into 108the root mount it should at least be its own HAMMER PFS. 109We recommend naming the large work area 110.Pa /build . 111Similarly if a machine is going to have a large number of users 112you might want to separate your 113.Pa /home 114out as well. 115.Pp 116A number of run-time 117.Xr mount 8 118options exist that can help you tune the system. 119The most obvious and most dangerous one is 120.Cm async . 121Do not ever use it; it is far too dangerous. 122A less dangerous and more 123useful 124.Xr mount 8 125option is called 126.Cm noatime . 127.Ux 128filesystems normally update the last-accessed time of a file or 129directory whenever it is accessed. 130However, this creates a massive burden on copy-on-write filesystems like 131HAMMER, particularly when scanning the filesystem. 132.Dx 133currently defaults to disabling atime updates on HAMMER mounts. 134It can be enabled by setting the 135.Va vfs.hammer.noatime 136tunable to 0 in 137.Xr loader.conf 5 138but we recommend leaving it disabled. 139The lack of atime updates can create issues with certain programs 140such as when detecting whether unread mail is present, but 141applications for the most part no longer depend on it. 142.Sh SSD SWAP 143The single most important thing you can do is have at least one 144solid-state drive in your system, and configure your swap space 145on that drive. 146If you are using a combination of a smaller SSD and a very larger HDD, 147you can use 148.Xr swapcache 8 149to automatically cache data from your HDD. 150But even if you do not, having swap space configured on your SSD will 151significantly improve performance under even modest paging loads. 152It is particularly useful to configure a significant amount of swap 153on a workstation, 32GB or more is not uncommon, to handle bloated 154leaky applications such as browsers. 155.Sh SYSCTL TUNING 156.Xr sysctl 8 157variables permit system behavior to be monitored and controlled at 158run-time. 159Some sysctls simply report on the behavior of the system; others allow 160the system behavior to be modified; 161some may be set at boot time using 162.Xr rc.conf 5 , 163but most will be set via 164.Xr sysctl.conf 5 . 165There are several hundred sysctls in the system, including many that appear 166to be candidates for tuning but actually are not. 167In this document we will only cover the ones that have the greatest effect 168on the system. 169.Pp 170The 171.Va kern.ipc.shm_use_phys 172sysctl defaults to 1 (on) and may be set to 0 (off) or 1 (on). 173Setting 174this parameter to 1 will cause all System V shared memory segments to be 175mapped to unpageable physical RAM. 176This feature only has an effect if you 177are either (A) mapping small amounts of shared memory across many (hundreds) 178of processes, or (B) mapping large amounts of shared memory across any 179number of processes. 180This feature allows the kernel to remove a great deal 181of internal memory management page-tracking overhead at the cost of wiring 182the shared memory into core, making it unswappable. 183.Pp 184The 185.Va vfs.write_behind 186sysctl defaults to 1 (on). This tells the filesystem to issue media 187writes as full clusters are collected, which typically occurs when writing 188large sequential files. The idea is to avoid saturating the buffer 189cache with dirty buffers when it would not benefit I/O performance. However, 190this may stall processes and under certain circumstances you may wish to turn 191it off. 192.Pp 193The 194.Va vfs.hirunningspace 195sysctl determines how much outstanding write I/O may be queued to 196disk controllers system wide at any given instance. The default is 197usually sufficient but on machines with lots of disks you may want to bump 198it up to four or five megabytes. Note that setting too high a value 199(exceeding the buffer cache's write threshold) can lead to extremely 200bad clustering performance. Do not set this value arbitrarily high! Also, 201higher write queueing values may add latency to reads occurring at the same 202time. 203.Pp 204There are various other buffer-cache and VM page cache related sysctls. 205We do not recommend modifying these values. 206As of 207.Fx 4.3 , 208the VM system does an extremely good job tuning itself. 209.Pp 210The 211.Va net.inet.tcp.sendspace 212and 213.Va net.inet.tcp.recvspace 214sysctls are of particular interest if you are running network intensive 215applications. 216They control the amount of send and receive buffer space 217allowed for any given TCP connection. 218However, 219.Dx 220now auto-tunes these parameters using a number of other related 221sysctls (run 'sysctl net.inet.tcp' to get a list) and usually 222no longer need to be tuned manually. 223We do not recommend 224increasing or decreasing the defaults if you are managing a very large 225number of connections. 226Note that the routing table (see 227.Xr route 8 ) 228can be used to introduce route-specific send and receive buffer size 229defaults. 230.Pp 231As an additional management tool you can use pipes in your 232firewall rules (see 233.Xr ipfw 8 ) 234to limit the bandwidth going to or from particular IP blocks or ports. 235For example, if you have a T1 you might want to limit your web traffic 236to 70% of the T1's bandwidth in order to leave the remainder available 237for mail and interactive use. 238Normally a heavily loaded web server 239will not introduce significant latencies into other services even if 240the network link is maxed out, but enforcing a limit can smooth things 241out and lead to longer term stability. 242Many people also enforce artificial 243bandwidth limitations in order to ensure that they are not charged for 244using too much bandwidth. 245.Pp 246Setting the send or receive TCP buffer to values larger than 65535 will result 247in a marginal performance improvement unless both hosts support the window 248scaling extension of the TCP protocol, which is controlled by the 249.Va net.inet.tcp.rfc1323 250sysctl. 251These extensions should be enabled and the TCP buffer size should be set 252to a value larger than 65536 in order to obtain good performance from 253certain types of network links; specifically, gigabit WAN links and 254high-latency satellite links. 255RFC 1323 support is enabled by default. 256.Pp 257The 258.Va net.inet.tcp.always_keepalive 259sysctl determines whether or not the TCP implementation should attempt 260to detect dead TCP connections by intermittently delivering 261.Dq keepalives 262on the connection. 263By default, this is now enabled for all applications. 264We do not recommend turning it off. 265The extra network bandwidth is minimal and this feature will clean-up 266stalled and long-dead connections that might not otherwise be cleaned 267up. 268In the past people using dialup connections often did not want to 269use this feature in order to be able to retain connections across 270long disconnections, but in modern day the only default that makes 271sense is for the feature to be turned on. 272.Pp 273The 274.Va net.inet.tcp.delayed_ack 275TCP feature is largely misunderstood. Historically speaking this feature 276was designed to allow the acknowledgement to transmitted data to be returned 277along with the response. For example, when you type over a remote shell 278the acknowledgement to the character you send can be returned along with the 279data representing the echo of the character. With delayed acks turned off 280the acknowledgement may be sent in its own packet before the remote service 281has a chance to echo the data it just received. This same concept also 282applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the 283number of tiny packets flowing across the network in half. The 284.Dx 285delayed-ack implementation also follows the TCP protocol rule that 286at least every other packet be acknowledged even if the standard 100ms 287timeout has not yet passed. Normally the worst a delayed ack can do is 288slightly delay the teardown of a connection, or slightly delay the ramp-up 289of a slow-start TCP connection. While we aren't sure we believe that 290the several FAQs related to packages such as SAMBA and SQUID which advise 291turning off delayed acks may be referring to the slow-start issue. 292.Pp 293The 294.Va net.inet.tcp.inflight_enable 295sysctl turns on bandwidth delay product limiting for all TCP connections. 296This feature is now turned on by default and we recommend that it be 297left on. 298It will slightly reduce the maximum bandwidth of a connection but the 299benefits of the feature in reducing packet backlogs at router constriction 300points are enormous. 301These benefits make it a whole lot easier for router algorithms to manage 302QOS for multiple connections. 303The limiting feature reduces the amount of data built up in intermediate 304router and switch packet queues as well as reduces the amount of data built 305up in the local host's interface queue. With fewer packets queued up, 306interactive connections, especially over slow modems, will also be able 307to operate with lower round trip times. However, note that this feature 308only affects data transmission (uploading / server-side). It does not 309affect data reception (downloading). 310.Pp 311The system will attempt to calculate the bandwidth delay product for each 312connection and limit the amount of data queued to the network to just the 313amount required to maintain optimum throughput. This feature is useful 314if you are serving data over modems, GigE, or high speed WAN links (or 315any other link with a high bandwidth*delay product), especially if you are 316also using window scaling or have configured a large send window. 317.Pp 318For production use setting 319.Va net.inet.tcp.inflight_min 320to at least 6144 may be beneficial. Note, however, that setting high 321minimums may effectively disable bandwidth limiting depending on the link. 322.Pp 323Adjusting 324.Va net.inet.tcp.inflight_stab 325is not recommended. 326This parameter defaults to 50, representing +5% fudge when calculating the 327bwnd from the bw. This fudge is on top of an additional fixed +2*maxseg 328added to bwnd. The fudge factor is required to stabilize the algorithm 329at very high speeds while the fixed 2*maxseg stabilizes the algorithm at 330low speeds. If you increase this value excessive packet buffering may occur. 331.Pp 332The 333.Va net.inet.ip.portrange.* 334sysctls control the port number ranges automatically bound to TCP and UDP 335sockets. There are three ranges: A low range, a default range, and a 336high range, selectable via an IP_PORTRANGE 337.Fn setsockopt 338call. 339Most network programs use the default range which is controlled by 340.Va net.inet.ip.portrange.first 341and 342.Va net.inet.ip.portrange.last , 343which defaults to 1024 and 5000 respectively. Bound port ranges are 344used for outgoing connections and it is possible to run the system out 345of ports under certain circumstances. This most commonly occurs when you are 346running a heavily loaded web proxy. The port range is not an issue 347when running serves which handle mainly incoming connections such as a 348normal web server, or has a limited number of outgoing connections such 349as a mail relay. For situations where you may run yourself out of 350ports we recommend increasing 351.Va net.inet.ip.portrange.last 352modestly. A value of 10000 or 20000 or 30000 may be reasonable. You should 353also consider firewall effects when changing the port range. Some firewalls 354may block large ranges of ports (usually low-numbered ports) and expect systems 355to use higher ranges of ports for outgoing connections. For this reason 356we do not recommend that 357.Va net.inet.ip.portrange.first 358be lowered. 359.Pp 360The 361.Va kern.ipc.somaxconn 362sysctl limits the size of the listen queue for accepting new TCP connections. 363The default value of 128 is typically too low for robust handling of new 364connections in a heavily loaded web server environment. 365For such environments, 366we recommend increasing this value to 1024 or higher. 367The service daemon 368may itself limit the listen queue size (e.g.\& 369.Xr sendmail 8 , 370apache) but will 371often have a directive in its configuration file to adjust the queue size up. 372Larger listen queues also do a better job of fending off denial of service 373attacks. 374.Pp 375The 376.Va kern.maxvnodes 377specifies how many vnodes and related file structures the kernel will 378cache. 379The kernel uses a very generous default for this parameter based on 380available physical memory. 381You generally do not want to mess with this parameter as it directly 382effects how well the kernel can cache not only file structures but also 383the underlying file data. 384But you can lower it if kernel memory use is higher than you would like. 385.Pp 386The 387.Va kern.maxfiles 388sysctl determines how many open files the system supports. 389The default is 390typically based on available physical memory but you may need to bump 391it up if you are running databases or large descriptor-heavy daemons. 392The read-only 393.Va kern.openfiles 394sysctl may be interrogated to determine the current number of open files 395on the system. 396.Pp 397The 398.Va vm.swap_idle_enabled 399sysctl is useful in large multi-user systems where you have lots of users 400entering and leaving the system and lots of idle processes. 401Such systems 402tend to generate a great deal of continuous pressure on free memory reserves. 403Turning this feature on and adjusting the swapout hysteresis (in idle 404seconds) via 405.Va vm.swap_idle_threshold1 406and 407.Va vm.swap_idle_threshold2 408allows you to depress the priority of pages associated with idle processes 409more quickly than the normal pageout algorithm. 410This gives a helping hand 411to the pageout daemon. 412Do not turn this option on unless you need it, 413because the tradeoff you are making is to essentially pre-page memory sooner 414rather than later, eating more swap and disk bandwidth. 415In a small system 416this option will have a detrimental effect but in a large system that is 417already doing moderate paging this option allows the VM system to stage 418whole processes into and out of memory more easily. 419.Sh LOADER TUNABLES 420Some aspects of the system behavior may not be tunable at runtime because 421memory allocations they perform must occur early in the boot process. 422To change loader tunables, you must set their values in 423.Xr loader.conf 5 424and reboot the system. 425.Pp 426.Va kern.maxusers 427controls the scaling of a number of static system tables, including defaults 428for the maximum number of open files, sizing of network memory resources, etc. 429On 430.Dx , 431.Va kern.maxusers 432is automatically sized at boot based on the amount of memory available in 433the system, and may be determined at run-time by inspecting the value of the 434read-only 435.Va kern.maxusers 436sysctl. 437Some sites will require larger or smaller values of 438.Va kern.maxusers 439and may set it as a loader tunable; values of 64, 128, and 256 are not 440uncommon. 441We do not recommend going above 256 unless you need a huge number 442of file descriptors; many of the tunable values set to their defaults by 443.Va kern.maxusers 444may be individually overridden at boot-time or run-time as described 445elsewhere in this document. 446.Pp 447.Va kern.nbuf 448sets how many filesystem buffers the kernel should cache. 449Filesystem buffers can be up to 128KB each. UFS typically uses an 8KB 450blocksize while HAMMER typically uses 64KB. 451The defaults usually suffice. 452The cached buffers represent wired physical memory so specifying a value 453that is too large can result in excessive kernel memory use, and is also 454not entirely necessary since the pages backing the buffers are also 455cached by the VM page cache (which does not use wired memory). 456The buffer cache significantly improves the hot path for cached file 457accesses. 458.Pp 459The 460.Va kern.dfldsiz 461and 462.Va kern.dflssiz 463tunables set the default soft limits for process data and stack size 464respectively. 465Processes may increase these up to the hard limits by calling 466.Xr setrlimit 2 . 467The 468.Va kern.maxdsiz , 469.Va kern.maxssiz , 470and 471.Va kern.maxtsiz 472tunables set the hard limits for process data, stack, and text size 473respectively; processes may not exceed these limits. 474The 475.Va kern.sgrowsiz 476tunable controls how much the stack segment will grow when a process 477needs to allocate more stack. 478.Pp 479.Va kern.ipc.nmbclusters 480and 481.Va kern.ipc.nmbjclusters 482may be adjusted to increase the number of network mbufs the system is 483willing to allocate. 484Each normal cluster represents approximately 2K of memory, 485so a value of 1024 represents 2M of kernel memory reserved for network 486buffers. 487Each 'j' cluster is typically 4KB, so a value of 1024 represents 4M of 488kernel memory. 489You can do a simple calculation to figure out how many you need but 490keep in mind that tcp buffer sizing is now more dynamic than it used to 491be. 492.Pp 493The defaults usually suffice but you may want to bump it up on service-heavy 494machines. 495Modern machines often need a large number of mbufs to operate services 496efficiently, values of 65536, even upwards of 262144 or more are common. 497If you are running a server, it is better to be generous than to be frugal. 498Remember the memory calculation though. 499.Pp 500Under no circumstances 501should you specify an arbitrarily high value for this parameter, it could 502lead to a boot-time crash. 503The 504.Fl m 505option to 506.Xr netstat 1 507may be used to observe network cluster use. 508.Sh KERNEL CONFIG TUNING 509There are a number of kernel options that you may have to fiddle with in 510a large-scale system. 511In order to change these options you need to be 512able to compile a new kernel from source. 513The 514.Xr config 8 515manual page and the handbook are good starting points for learning how to 516do this. 517Generally the first thing you do when creating your own custom 518kernel is to strip out all the drivers and services you do not use. 519Removing things like 520.Dv INET6 521and drivers you do not have will reduce the size of your kernel, sometimes 522by a megabyte or more, leaving more memory available for applications. 523.Pp 524If your motherboard is AHCI-capable then we strongly recommend turning 525on AHCI mode in the BIOS if it is not the default. 526.Sh CPU, MEMORY, DISK, NETWORK 527The type of tuning you do depends heavily on where your system begins to 528bottleneck as load increases. 529If your system runs out of CPU (idle times 530are perpetually 0%) then you need to consider upgrading the CPU or moving to 531an SMP motherboard (multiple CPU's), or perhaps you need to revisit the 532programs that are causing the load and try to optimize them. 533If your system 534is paging to swap a lot you need to consider adding more memory. 535If your 536system is saturating the disk you typically see high CPU idle times and 537total disk saturation. 538.Xr systat 1 539can be used to monitor this. 540There are many solutions to saturated disks: 541increasing memory for caching, mirroring disks, distributing operations across 542several machines, and so forth. 543.Pp 544Finally, you might run out of network suds. 545Optimize the network path 546as much as possible. 547If you are operating a machine as a router you may need to 548setup a 549.Xr pf 4 550firewall (also see 551.Xr firewall 7 . 552.Dx 553has a very good fair-share queueing algorithm for QOS in 554.Xr pf 4 . 555.Sh SOURCE OF KERNEL MEMORY USAGE 556The primary sources of kernel memory usage are: 557.Pp 558.Bl -tag 559.It Va kern.maxvnodes 560The maximum number of cached vnodes in the system. 561These can eat quite a bit of kernel memory, primarily due to auxillary 562structures tracked by the HAMMER filesystem. 563It is relatively easy to configure a smaller value, but we do not 564recommend reducing this parameter below 100000. 565Smaller values directly impact the number of discrete files the 566kernel can cache data for at once. 567.It Va kern.ipc.nmbclusters 568.It Va kern.ipc.nmbjclusters 569Calculate approximately 2KB per normal cluster and 4KB per jumbo 570cluster. 571Do not make these values too low or you risk deadlocking the network 572stack. 573.It Va kern.nbuf 574The number of filesystem buffers managed by the kernel. 575The kernel wires the underlying cached VM pages, typically 8KB (UFS) or 57664KB (HAMMER) per buffer. 577.It swap/swapcache 578Swap memory requires approximately 1MB of physical ram for each 1GB 579of swap space. 580When swapcache is used, additional memory may be required to keep 581VM objects around longer (only really reducable by reducing the 582value of 583.Va kern.maxvnodes 584which you can do post-boot if you desire). 585.It tmpfs 586Tmpfs is very useful but keep in mind that while the file data itself 587is backed by swap, the meta-data (the directory topology) requires 588wired kernel memory. 589.It mmu page tables 590Even though the underlying data pages themselves can be paged to swap, 591the page tables are usually wired into memory. 592This can create problems when a large number of processes are mmap()ing 593very large files. 594Sometimes turning on 595.Va machdep.pmap_mmu_optimize 596suffices to reduce overhead. 597Page table kernel memory use can be observed by using 'vmstat -z' 598.It Va kern.ipc.shm_use_phys 599It is sometimes necessary to force shared memory to use physical memory 600when running a large database which uses shared memory to implement its 601own data caching. 602The use of sysv shared memory in this regard allows the database to 603distinguish between data which it knows it can access instantly (i.e. 604without even having to page-in from swap) verses data which it might require 605and I/O to fetch. 606.Pp 607If you use this feature be very careful with regards to the database's 608shared memory configuration as you will be wiring the memory. 609.El 610.Sh SEE ALSO 611.Xr boot 8 , 612.Xr ccdconfig 8 , 613.Xr config 8 , 614.Xr disklabel 8 , 615.Xr dm 4 , 616.Xr dummynet 4 , 617.Xr firewall 7 , 618.Xr fsck 8 , 619.Xr hier 7 , 620.Xr ifconfig 8 , 621.Xr ipfw 8 , 622.Xr loader 8 , 623.Xr login.conf 5 , 624.Xr mount 8 , 625.Xr nata 4 , 626.Xr netstat 1 , 627.Xr newfs 8 , 628.Xr pf 4 , 629.Xr pf.conf 5 , 630.Xr rc.conf 5 , 631.Xr route 8 , 632.Xr systat 1 , 633.Xr sysctl 8 , 634.Xr sysctl.conf 5 , 635.Xr tunefs 8 636.Sh HISTORY 637The 638.Nm 639manual page was originally written by 640.An Matthew Dillon 641and first appeared 642in 643.Fx 4.3 , 644May 2001. 645