xref: /dflybsd-src/share/man/man7/tuning.7 (revision 0c1d7dca433e727c476aff53acb839b357a28ef6)
1.\" Copyright (c) 2001 Matthew Dillon.  Terms and conditions are those of
2.\" the BSD Copyright as specified in the file "/usr/src/COPYRIGHT" in
3.\" the source tree.
4.\"
5.Dd June 12, 2016
6.Dt TUNING 7
7.Os
8.Sh NAME
9.Nm tuning
10.Nd performance tuning under DragonFly
11.Sh SYSTEM SETUP
12Modern
13.Dx
14systems typically have just three partitions on the main drive.
15In order, a UFS
16.Pa /boot ,
17.Pa swap ,
18and a HAMMER
19.Pa root .
20The installer used to create separate PFSs for half a dozen directories,
21but now it just puts (almost) everything in the root.
22It will separate stuff that doesn't need to be backed up into a /build
23subdirectory and create null-mounts for things like /usr/obj, but it
24no longer creates separate PFSs for these.
25If desired, you can make /build its own mount to separate-out the
26components of the filesystem which do not need to be persistent.
27.Pp
28Generally speaking the
29.Pa /boot
30partition should be 1GB in size.  This is the minimum recommended
31size, giving you room for backup kernels and alternative boot schemes.
32.Dx
33always installs debug-enabled kernels and modules and these can take
34up quite a bit of disk space (but will not take up any extra ram).
35.Pp
36In the old days we recommended that swap be sized to at least 2x main
37memory.  These days swap is often used for other activities, including
38.Xr tmpfs 5
39and
40.Xr swapcache 8 .
41We recommend that swap be sized to the larger of 2x main memory or
421GB if you have a fairly small disk and up to 16GB if you have a
43moderately endowed system and a large drive.
44Or even larger if you have a SSD+HDD system in order to use swapcache.
45If you are on a minimally configured machine you may, of course,
46configure far less swap or no swap at all but we recommend at least
47some swap.
48The kernel's VM paging algorithms are tuned to perform best when there is
49at least 2x swap versus main memory.
50Configuring too little swap can lead to inefficiencies in the VM
51page scanning code as well as create issues later on if you add
52more memory to your machine.
53Swap is a good idea even if you don't think you will ever need it as it
54allows the
55machine to page out completely unused data from idle programs (like getty),
56maximizing the ram available for your activities.
57.Pp
58If you intend to use the
59.Xr swapcache 8
60facility with a SSD we recommend the SSD be configured with at
61least a 32G swap partition.
62If you are on a moderately well configured 64-bit system you can
63size swap even larger.
64Keep in mind that each 1GByte of swapcache requires around 1MByte of
65ram.
66.Pp
67Finally, on larger systems with multiple drives, if the use
68of SSD swap is not in the cards or if it is and you need higher-than-normal
69swapcache bandwidth, you can configure swap on up to four drives and
70the kernel will interleave the storage.
71The swap partitions on the drives should be approximately the same size.
72The kernel can handle arbitrary sizes but
73internal data structures scale to 4 times the largest swap partition.
74Keeping
75the swap partitions near the same size will allow the kernel to optimally
76stripe swap space across the N disks.
77Do not worry about overdoing it a
78little, swap space is the saving grace of
79.Ux
80and even if you do not normally use much swap, it can give you more time to
81recover from a runaway program before being forced to reboot.
82However, keep in mind that any sort of swap space failure can lock the
83system up.
84Most machines are setup with only one or two swap partitions.
85.Pp
86Most
87.Dx
88systems have a single HAMMER root.
89PFSs can be used to administratively separate domains for backup purposes
90but tend to be a hassle otherwise so if you don't need the administrative
91separation you don't really need to use multiple HAMMER PFSs.
92All the PFSs share the same allocation layer so there is no longer a need
93to size each individual mount.
94Instead you should review the
95.Xr hammer 8
96manual page and use the 'hammer viconfig' facility to adjust snapshot
97retention and other parameters.
98By default
99HAMMER keeps 60 days worth of snapshots.
100Usually snapshots are not desired on PFSs such as
101.Pa /usr/obj
102or
103.Pa /tmp
104since data on these partitions cycles a lot.
105.Pp
106If a very large work area is desired it is often beneficial to
107configure it as a separate HAMMER mount.  If it is integrated into
108the root mount it should at least be its own HAMMER PFS.
109We recommend naming the large work area
110.Pa /build .
111Similarly if a machine is going to have a large number of users
112you might want to separate your
113.Pa /home
114out as well.
115.Pp
116A number of run-time
117.Xr mount 8
118options exist that can help you tune the system.
119The most obvious and most dangerous one is
120.Cm async .
121Do not ever use it; it is far too dangerous.
122A less dangerous and more
123useful
124.Xr mount 8
125option is called
126.Cm noatime .
127.Ux
128filesystems normally update the last-accessed time of a file or
129directory whenever it is accessed.
130However, this creates a massive burden on copy-on-write filesystems like
131HAMMER, particularly when scanning the filesystem.
132.Dx
133currently defaults to disabling atime updates on HAMMER mounts.
134It can be enabled by setting the
135.Va vfs.hammer.noatime
136tunable to 0 in
137.Xr loader.conf 5
138but we recommend leaving it disabled.
139The lack of atime updates can create issues with certain programs
140such as when detecting whether unread mail is present, but
141applications for the most part no longer depend on it.
142.Sh SSD SWAP
143The single most important thing you can do is have at least one
144solid-state drive in your system, and configure your swap space
145on that drive.
146If you are using a combination of a smaller SSD and a very larger HDD,
147you can use
148.Xr swapcache 8
149to automatically cache data from your HDD.
150But even if you do not, having swap space configured on your SSD will
151significantly improve performance under even modest paging loads.
152It is particularly useful to configure a significant amount of swap
153on a workstation, 32GB or more is not uncommon, to handle bloated
154leaky applications such as browsers.
155.Sh SYSCTL TUNING
156.Xr sysctl 8
157variables permit system behavior to be monitored and controlled at
158run-time.
159Some sysctls simply report on the behavior of the system; others allow
160the system behavior to be modified;
161some may be set at boot time using
162.Xr rc.conf 5 ,
163but most will be set via
164.Xr sysctl.conf 5 .
165There are several hundred sysctls in the system, including many that appear
166to be candidates for tuning but actually are not.
167In this document we will only cover the ones that have the greatest effect
168on the system.
169.Pp
170The
171.Va kern.ipc.shm_use_phys
172sysctl defaults to 1 (on) and may be set to 0 (off) or 1 (on).
173Setting
174this parameter to 1 will cause all System V shared memory segments to be
175mapped to unpageable physical RAM.
176This feature only has an effect if you
177are either (A) mapping small amounts of shared memory across many (hundreds)
178of processes, or (B) mapping large amounts of shared memory across any
179number of processes.
180This feature allows the kernel to remove a great deal
181of internal memory management page-tracking overhead at the cost of wiring
182the shared memory into core, making it unswappable.
183.Pp
184The
185.Va vfs.write_behind
186sysctl defaults to 1 (on).  This tells the filesystem to issue media
187writes as full clusters are collected, which typically occurs when writing
188large sequential files.  The idea is to avoid saturating the buffer
189cache with dirty buffers when it would not benefit I/O performance.  However,
190this may stall processes and under certain circumstances you may wish to turn
191it off.
192.Pp
193The
194.Va vfs.hirunningspace
195sysctl determines how much outstanding write I/O may be queued to
196disk controllers system wide at any given instance.  The default is
197usually sufficient but on machines with lots of disks you may want to bump
198it up to four or five megabytes.  Note that setting too high a value
199(exceeding the buffer cache's write threshold) can lead to extremely
200bad clustering performance.  Do not set this value arbitrarily high!  Also,
201higher write queueing values may add latency to reads occurring at the same
202time.
203.Pp
204There are various other buffer-cache and VM page cache related sysctls.
205We do not recommend modifying these values.
206As of
207.Fx 4.3 ,
208the VM system does an extremely good job tuning itself.
209.Pp
210The
211.Va net.inet.tcp.sendspace
212and
213.Va net.inet.tcp.recvspace
214sysctls are of particular interest if you are running network intensive
215applications.
216They control the amount of send and receive buffer space
217allowed for any given TCP connection.
218However,
219.Dx
220now auto-tunes these parameters using a number of other related
221sysctls (run 'sysctl net.inet.tcp' to get a list) and usually
222no longer need to be tuned manually.
223We do not recommend
224increasing or decreasing the defaults if you are managing a very large
225number of connections.
226Note that the routing table (see
227.Xr route 8 )
228can be used to introduce route-specific send and receive buffer size
229defaults.
230.Pp
231As an additional management tool you can use pipes in your
232firewall rules (see
233.Xr ipfw 8 )
234to limit the bandwidth going to or from particular IP blocks or ports.
235For example, if you have a T1 you might want to limit your web traffic
236to 70% of the T1's bandwidth in order to leave the remainder available
237for mail and interactive use.
238Normally a heavily loaded web server
239will not introduce significant latencies into other services even if
240the network link is maxed out, but enforcing a limit can smooth things
241out and lead to longer term stability.
242Many people also enforce artificial
243bandwidth limitations in order to ensure that they are not charged for
244using too much bandwidth.
245.Pp
246Setting the send or receive TCP buffer to values larger than 65535 will result
247in a marginal performance improvement unless both hosts support the window
248scaling extension of the TCP protocol, which is controlled by the
249.Va net.inet.tcp.rfc1323
250sysctl.
251These extensions should be enabled and the TCP buffer size should be set
252to a value larger than 65536 in order to obtain good performance from
253certain types of network links; specifically, gigabit WAN links and
254high-latency satellite links.
255RFC 1323 support is enabled by default.
256.Pp
257The
258.Va net.inet.tcp.always_keepalive
259sysctl determines whether or not the TCP implementation should attempt
260to detect dead TCP connections by intermittently delivering
261.Dq keepalives
262on the connection.
263By default, this is now enabled for all applications.
264We do not recommend turning it off.
265The extra network bandwidth is minimal and this feature will clean-up
266stalled and long-dead connections that might not otherwise be cleaned
267up.
268In the past people using dialup connections often did not want to
269use this feature in order to be able to retain connections across
270long disconnections, but in modern day the only default that makes
271sense is for the feature to be turned on.
272.Pp
273The
274.Va net.inet.tcp.delayed_ack
275TCP feature is largely misunderstood.  Historically speaking this feature
276was designed to allow the acknowledgement to transmitted data to be returned
277along with the response.  For example, when you type over a remote shell
278the acknowledgement to the character you send can be returned along with the
279data representing the echo of the character.   With delayed acks turned off
280the acknowledgement may be sent in its own packet before the remote service
281has a chance to echo the data it just received.  This same concept also
282applies to any interactive protocol (e.g. SMTP, WWW, POP3) and can cut the
283number of tiny packets flowing across the network in half.   The
284.Dx
285delayed-ack implementation also follows the TCP protocol rule that
286at least every other packet be acknowledged even if the standard 100ms
287timeout has not yet passed.  Normally the worst a delayed ack can do is
288slightly delay the teardown of a connection, or slightly delay the ramp-up
289of a slow-start TCP connection.  While we aren't sure we believe that
290the several FAQs related to packages such as SAMBA and SQUID which advise
291turning off delayed acks may be referring to the slow-start issue.
292.Pp
293The
294.Va net.inet.tcp.inflight_enable
295sysctl turns on bandwidth delay product limiting for all TCP connections.
296This feature is now turned on by default and we recommend that it be
297left on.
298It will slightly reduce the maximum bandwidth of a connection but the
299benefits of the feature in reducing packet backlogs at router constriction
300points are enormous.
301These benefits make it a whole lot easier for router algorithms to manage
302QOS for multiple connections.
303The limiting feature reduces the amount of data built up in intermediate
304router and switch packet queues as well as reduces the amount of data built
305up in the local host's interface queue.  With fewer packets queued up,
306interactive connections, especially over slow modems, will also be able
307to operate with lower round trip times.  However, note that this feature
308only affects data transmission (uploading / server-side).  It does not
309affect data reception (downloading).
310.Pp
311The system will attempt to calculate the bandwidth delay product for each
312connection and limit the amount of data queued to the network to just the
313amount required to maintain optimum throughput.  This feature is useful
314if you are serving data over modems, GigE, or high speed WAN links (or
315any other link with a high bandwidth*delay product), especially if you are
316also using window scaling or have configured a large send window.
317.Pp
318For production use setting
319.Va net.inet.tcp.inflight_min
320to at least 6144 may be beneficial.  Note, however, that setting high
321minimums may effectively disable bandwidth limiting depending on the link.
322.Pp
323Adjusting
324.Va net.inet.tcp.inflight_stab
325is not recommended.
326This parameter defaults to 50, representing +5% fudge when calculating the
327bwnd from the bw.  This fudge is on top of an additional fixed +2*maxseg
328added to bwnd.  The fudge factor is required to stabilize the algorithm
329at very high speeds while the fixed 2*maxseg stabilizes the algorithm at
330low speeds.  If you increase this value excessive packet buffering may occur.
331.Pp
332The
333.Va net.inet.ip.portrange.*
334sysctls control the port number ranges automatically bound to TCP and UDP
335sockets.  There are three ranges:  A low range, a default range, and a
336high range, selectable via an IP_PORTRANGE
337.Fn setsockopt
338call.
339Most network programs use the default range which is controlled by
340.Va net.inet.ip.portrange.first
341and
342.Va net.inet.ip.portrange.last ,
343which defaults to 1024 and 5000 respectively.  Bound port ranges are
344used for outgoing connections and it is possible to run the system out
345of ports under certain circumstances.  This most commonly occurs when you are
346running a heavily loaded web proxy.  The port range is not an issue
347when running serves which handle mainly incoming connections such as a
348normal web server, or has a limited number of outgoing connections such
349as a mail relay.  For situations where you may run yourself out of
350ports we recommend increasing
351.Va net.inet.ip.portrange.last
352modestly.  A value of 10000 or 20000 or 30000 may be reasonable.  You should
353also consider firewall effects when changing the port range.  Some firewalls
354may block large ranges of ports (usually low-numbered ports) and expect systems
355to use higher ranges of ports for outgoing connections.  For this reason
356we do not recommend that
357.Va net.inet.ip.portrange.first
358be lowered.
359.Pp
360The
361.Va kern.ipc.somaxconn
362sysctl limits the size of the listen queue for accepting new TCP connections.
363The default value of 128 is typically too low for robust handling of new
364connections in a heavily loaded web server environment.
365For such environments,
366we recommend increasing this value to 1024 or higher.
367The service daemon
368may itself limit the listen queue size (e.g.\&
369.Xr sendmail 8 ,
370apache) but will
371often have a directive in its configuration file to adjust the queue size up.
372Larger listen queues also do a better job of fending off denial of service
373attacks.
374.Pp
375The
376.Va kern.maxvnodes
377specifies how many vnodes and related file structures the kernel will
378cache.
379The kernel uses a very generous default for this parameter based on
380available physical memory.
381You generally do not want to mess with this parameter as it directly
382effects how well the kernel can cache not only file structures but also
383the underlying file data.
384But you can lower it if kernel memory use is higher than you would like.
385.Pp
386The
387.Va kern.maxfiles
388sysctl determines how many open files the system supports.
389The default is
390typically based on available physical memory but you may need to bump
391it up if you are running databases or large descriptor-heavy daemons.
392The read-only
393.Va kern.openfiles
394sysctl may be interrogated to determine the current number of open files
395on the system.
396.Pp
397The
398.Va vm.swap_idle_enabled
399sysctl is useful in large multi-user systems where you have lots of users
400entering and leaving the system and lots of idle processes.
401Such systems
402tend to generate a great deal of continuous pressure on free memory reserves.
403Turning this feature on and adjusting the swapout hysteresis (in idle
404seconds) via
405.Va vm.swap_idle_threshold1
406and
407.Va vm.swap_idle_threshold2
408allows you to depress the priority of pages associated with idle processes
409more quickly than the normal pageout algorithm.
410This gives a helping hand
411to the pageout daemon.
412Do not turn this option on unless you need it,
413because the tradeoff you are making is to essentially pre-page memory sooner
414rather than later, eating more swap and disk bandwidth.
415In a small system
416this option will have a detrimental effect but in a large system that is
417already doing moderate paging this option allows the VM system to stage
418whole processes into and out of memory more easily.
419.Sh LOADER TUNABLES
420Some aspects of the system behavior may not be tunable at runtime because
421memory allocations they perform must occur early in the boot process.
422To change loader tunables, you must set their values in
423.Xr loader.conf 5
424and reboot the system.
425.Pp
426.Va kern.maxusers
427controls the scaling of a number of static system tables, including defaults
428for the maximum number of open files, sizing of network memory resources, etc.
429On
430.Dx ,
431.Va kern.maxusers
432is automatically sized at boot based on the amount of memory available in
433the system, and may be determined at run-time by inspecting the value of the
434read-only
435.Va kern.maxusers
436sysctl.
437Some sites will require larger or smaller values of
438.Va kern.maxusers
439and may set it as a loader tunable; values of 64, 128, and 256 are not
440uncommon.
441We do not recommend going above 256 unless you need a huge number
442of file descriptors; many of the tunable values set to their defaults by
443.Va kern.maxusers
444may be individually overridden at boot-time or run-time as described
445elsewhere in this document.
446.Pp
447.Va kern.nbuf
448sets how many filesystem buffers the kernel should cache.
449Filesystem buffers can be up to 128KB each.  UFS typically uses an 8KB
450blocksize while HAMMER typically uses 64KB.
451The defaults usually suffice.
452The cached buffers represent wired physical memory so specifying a value
453that is too large can result in excessive kernel memory use, and is also
454not entirely necessary since the pages backing the buffers are also
455cached by the VM page cache (which does not use wired memory).
456The buffer cache significantly improves the hot path for cached file
457accesses.
458.Pp
459The
460.Va kern.dfldsiz
461and
462.Va kern.dflssiz
463tunables set the default soft limits for process data and stack size
464respectively.
465Processes may increase these up to the hard limits by calling
466.Xr setrlimit 2 .
467The
468.Va kern.maxdsiz ,
469.Va kern.maxssiz ,
470and
471.Va kern.maxtsiz
472tunables set the hard limits for process data, stack, and text size
473respectively; processes may not exceed these limits.
474The
475.Va kern.sgrowsiz
476tunable controls how much the stack segment will grow when a process
477needs to allocate more stack.
478.Pp
479.Va kern.ipc.nmbclusters
480and
481.Va kern.ipc.nmbjclusters
482may be adjusted to increase the number of network mbufs the system is
483willing to allocate.
484Each normal cluster represents approximately 2K of memory,
485so a value of 1024 represents 2M of kernel memory reserved for network
486buffers.
487Each 'j' cluster is typically 4KB, so a value of 1024 represents 4M of
488kernel memory.
489You can do a simple calculation to figure out how many you need but
490keep in mind that tcp buffer sizing is now more dynamic than it used to
491be.
492.Pp
493The defaults usually suffice but you may want to bump it up on service-heavy
494machines.
495Modern machines often need a large number of mbufs to operate services
496efficiently, values of 65536, even upwards of 262144 or more are common.
497If you are running a server, it is better to be generous than to be frugal.
498Remember the memory calculation though.
499.Pp
500Under no circumstances
501should you specify an arbitrarily high value for this parameter, it could
502lead to a boot-time crash.
503The
504.Fl m
505option to
506.Xr netstat 1
507may be used to observe network cluster use.
508.Sh KERNEL CONFIG TUNING
509There are a number of kernel options that you may have to fiddle with in
510a large-scale system.
511In order to change these options you need to be
512able to compile a new kernel from source.
513The
514.Xr config 8
515manual page and the handbook are good starting points for learning how to
516do this.
517Generally the first thing you do when creating your own custom
518kernel is to strip out all the drivers and services you do not use.
519Removing things like
520.Dv INET6
521and drivers you do not have will reduce the size of your kernel, sometimes
522by a megabyte or more, leaving more memory available for applications.
523.Pp
524If your motherboard is AHCI-capable then we strongly recommend turning
525on AHCI mode in the BIOS if it is not the default.
526.Sh CPU, MEMORY, DISK, NETWORK
527The type of tuning you do depends heavily on where your system begins to
528bottleneck as load increases.
529If your system runs out of CPU (idle times
530are perpetually 0%) then you need to consider upgrading the CPU or moving to
531an SMP motherboard (multiple CPU's), or perhaps you need to revisit the
532programs that are causing the load and try to optimize them.
533If your system
534is paging to swap a lot you need to consider adding more memory.
535If your
536system is saturating the disk you typically see high CPU idle times and
537total disk saturation.
538.Xr systat 1
539can be used to monitor this.
540There are many solutions to saturated disks:
541increasing memory for caching, mirroring disks, distributing operations across
542several machines, and so forth.
543.Pp
544Finally, you might run out of network suds.
545Optimize the network path
546as much as possible.
547If you are operating a machine as a router you may need to
548setup a
549.Xr pf 4
550firewall (also see
551.Xr firewall 7 .
552.Dx
553has a very good fair-share queueing algorithm for QOS in
554.Xr pf 4 .
555.Sh SOURCE OF KERNEL MEMORY USAGE
556The primary sources of kernel memory usage are:
557.Pp
558.Bl -tag
559.It Va kern.maxvnodes
560The maximum number of cached vnodes in the system.
561These can eat quite a bit of kernel memory, primarily due to auxillary
562structures tracked by the HAMMER filesystem.
563It is relatively easy to configure a smaller value, but we do not
564recommend reducing this parameter below 100000.
565Smaller values directly impact the number of discrete files the
566kernel can cache data for at once.
567.It Va kern.ipc.nmbclusters
568.It Va kern.ipc.nmbjclusters
569Calculate approximately 2KB per normal cluster and 4KB per jumbo
570cluster.
571Do not make these values too low or you risk deadlocking the network
572stack.
573.It Va kern.nbuf
574The number of filesystem buffers managed by the kernel.
575The kernel wires the underlying cached VM pages, typically 8KB (UFS) or
57664KB (HAMMER) per buffer.
577.It swap/swapcache
578Swap memory requires approximately 1MB of physical ram for each 1GB
579of swap space.
580When swapcache is used, additional memory may be required to keep
581VM objects around longer (only really reducable by reducing the
582value of
583.Va kern.maxvnodes
584which you can do post-boot if you desire).
585.It tmpfs
586Tmpfs is very useful but keep in mind that while the file data itself
587is backed by swap, the meta-data (the directory topology) requires
588wired kernel memory.
589.It mmu page tables
590Even though the underlying data pages themselves can be paged to swap,
591the page tables are usually wired into memory.
592This can create problems when a large number of processes are mmap()ing
593very large files.
594Sometimes turning on
595.Va machdep.pmap_mmu_optimize
596suffices to reduce overhead.
597Page table kernel memory use can be observed by using 'vmstat -z'
598.It Va kern.ipc.shm_use_phys
599It is sometimes necessary to force shared memory to use physical memory
600when running a large database which uses shared memory to implement its
601own data caching.
602The use of sysv shared memory in this regard allows the database to
603distinguish between data which it knows it can access instantly (i.e.
604without even having to page-in from swap) verses data which it might require
605and I/O to fetch.
606.Pp
607If you use this feature be very careful with regards to the database's
608shared memory configuration as you will be wiring the memory.
609.El
610.Sh SEE ALSO
611.Xr boot 8 ,
612.Xr ccdconfig 8 ,
613.Xr config 8 ,
614.Xr disklabel 8 ,
615.Xr dm 4 ,
616.Xr dummynet 4 ,
617.Xr firewall 7 ,
618.Xr fsck 8 ,
619.Xr hier 7 ,
620.Xr ifconfig 8 ,
621.Xr ipfw 8 ,
622.Xr loader 8 ,
623.Xr login.conf 5 ,
624.Xr mount 8 ,
625.Xr nata 4 ,
626.Xr netstat 1 ,
627.Xr newfs 8 ,
628.Xr pf 4 ,
629.Xr pf.conf 5 ,
630.Xr rc.conf 5 ,
631.Xr route 8 ,
632.Xr systat 1 ,
633.Xr sysctl 8 ,
634.Xr sysctl.conf 5 ,
635.Xr tunefs 8
636.Sh HISTORY
637The
638.Nm
639manual page was originally written by
640.An Matthew Dillon
641and first appeared
642in
643.Fx 4.3 ,
644May 2001.
645