xref: /dflybsd-src/share/man/man8/swapcache.8 (revision 8b14c46e6d0c39641a763b1841bd72bf5d35fc4c)
13ffc7051SMatthew Dillon.\"
23ffc7051SMatthew Dillon.\" swapcache - Cache clean filesystem data & meta-data on SSD-based swap
33ffc7051SMatthew Dillon.\"
43ffc7051SMatthew Dillon.\" Redistribution and use in source and binary forms, with or without
53ffc7051SMatthew Dillon.\" modification, are permitted provided that the following conditions
63ffc7051SMatthew Dillon.\" are met:
73ffc7051SMatthew Dillon.\" 1. Redistributions of source code must retain the above copyright
83ffc7051SMatthew Dillon.\"    notice, this list of conditions and the following disclaimer.
93ffc7051SMatthew Dillon.\" 2. Redistributions in binary form must reproduce the above copyright
103ffc7051SMatthew Dillon.\"    notice, this list of conditions and the following disclaimer in the
113ffc7051SMatthew Dillon.\"    documentation and/or other materials provided with the distribution.
123ffc7051SMatthew Dillon.Dd February 7, 2010
133ffc7051SMatthew Dillon.Dt SWAPCACHE 8
143ffc7051SMatthew Dillon.Os
153ffc7051SMatthew Dillon.Sh NAME
163ffc7051SMatthew Dillon.Nm swapcache
1767bda820SThomas Nikolajsen.Nd a mechanism to use fast swap to cache filesystem data and meta-data
1826353f58SMatthew Dillon.Sh SYNOPSIS
193ffc7051SMatthew Dillon.Cd sysctl vm.swapcache.accrate=100000
203ffc7051SMatthew Dillon.Cd sysctl vm.swapcache.maxfilesize=0
213ffc7051SMatthew Dillon.Cd sysctl vm.swapcache.maxburst=2000000000
223ffc7051SMatthew Dillon.Cd sysctl vm.swapcache.curburst=4000000000
23ed7b872cSMatthew Dillon.Cd sysctl vm.swapcache.minburst=10000000
24ed7b872cSMatthew Dillon.Cd sysctl vm.swapcache.read_enable=0
25ed7b872cSMatthew Dillon.Cd sysctl vm.swapcache.meta_enable=0
26ed7b872cSMatthew Dillon.Cd sysctl vm.swapcache.data_enable=0
27ed7b872cSMatthew Dillon.Cd sysctl vm.swapcache.use_chflags=1
28ed7b872cSMatthew Dillon.Cd sysctl vm.swapcache.maxlaunder=256
2975cdc755SMatthew Dillon.Cd sysctl vm.swapcache.hysteresis=(vm.stats.vm.v_inactive_target/2)
30ed7b872cSMatthew Dillon.Sh DESCRIPTION
31ed7b872cSMatthew Dillon.Nm
32ed7b872cSMatthew Dillonis a system capability which allows a solid state disk (SSD) in a swap
33ed7b872cSMatthew Dillonspace configuration to be used to cache clean filesystem data and meta-data
34ed7b872cSMatthew Dillonin addition to its normal function of backing anonymous memory.
35ed7b872cSMatthew Dillon.Pp
36ed7b872cSMatthew DillonSysctls are used to manage operational parameters and can be adjusted at
3767bda820SThomas Nikolajsenany time.
3867bda820SThomas NikolajsenTypically a large initial burst is desired after system boot,
39ed7b872cSMatthew Dilloncontrolled by the initial
4067bda820SThomas Nikolajsen.Va vm.swapcache.curburst
41ed7b872cSMatthew Dillonparameter.
42ed7b872cSMatthew DillonThis parameter is reduced as data is written to swap by the swapcache
43ed7b872cSMatthew Dillonand increased at a rate specified by
4467bda820SThomas Nikolajsen.Va vm.swapcache.accrate .
45ed7b872cSMatthew DillonOnce this parameter reaches zero write activity ceases until it has
46ed7b872cSMatthew Dillonrecovered sufficiently for write activity to resume.
47ed7b872cSMatthew Dillon.Pp
4867bda820SThomas Nikolajsen.Va vm.swapcache.meta_enable
4967bda820SThomas Nikolajsenenables the writing of filesystem meta-data to the swapcache.
5067bda820SThomas NikolajsenFilesystem
51ed7b872cSMatthew Dillonmetadata is any data which the filesystem accesses via the disk device
5267bda820SThomas Nikolajsenusing buffercache.
5367bda820SThomas NikolajsenMeta-data is cached globally regardless of file or directory flags.
54ed7b872cSMatthew Dillon.Pp
5567bda820SThomas Nikolajsen.Va vm.swapcache.data_enable
5626353f58SMatthew Dillonenables the writing of clean filesystem file-data to the swapcache.
5726353f58SMatthew DillonFilesystem filedata is any data which the filesystem accesses via a
5867bda820SThomas Nikolajsenregular file.
5967bda820SThomas NikolajsenIn technical terms, when the buffer cache is used to access
6026353f58SMatthew Dillona regular file through its vnode.
6167bda820SThomas NikolajsenPlease do not blindly turn on this option, see the
6267bda820SThomas Nikolajsen.Sx PERFORMANCE TUNING
6326353f58SMatthew Dillonsection for more information.
64ed7b872cSMatthew Dillon.Pp
6567bda820SThomas Nikolajsen.Va vm.swapcache.use_chflags
66ed7b872cSMatthew Dillonenables the use of the
6767bda820SThomas Nikolajsen.Va cache
68ed7b872cSMatthew Dillonand
6967bda820SThomas Nikolajsen.Va noscache
70ed7b872cSMatthew Dillon.Xr chflags 1
71ed7b872cSMatthew Dillonflags to control which files will be data-cached.
7267bda820SThomas NikolajsenIf this sysctl is disabled and
7367bda820SThomas Nikolajsen.Va data_enable
7467bda820SThomas Nikolajsenis enabled, the system will ignore file flags and attempt to
7567bda820SThomas Nikolajsenswapcache all regular files.
76ed7b872cSMatthew Dillon.Pp
7767bda820SThomas Nikolajsen.Va vm.swapcache.read_enable
78ed7b872cSMatthew Dillonenables reading from the swapcache and should be set to 1 for normal
79ed7b872cSMatthew Dillonoperation.
80ed7b872cSMatthew Dillon.Pp
8167bda820SThomas Nikolajsen.Va vm.swapcache.maxfilesize
82ed7b872cSMatthew Dilloncontrols which files are to be cached based on their size.
83ed7b872cSMatthew DillonIf set to non-zero only files smaller than the specified size
8467bda820SThomas Nikolajsenwill be cached.
8567bda820SThomas NikolajsenLarger files will not be cached.
8675cdc755SMatthew Dillon.Pp
8767bda820SThomas Nikolajsen.Va vm.swapcache.maxlaunder
8875cdc755SMatthew Dilloncontrols the maximum number of clean VM pages which will be added to
8975cdc755SMatthew Dillonthe swap cache and written out to swap on each poll.
9075cdc755SMatthew DillonSwapcache polls ten times a second.
9175cdc755SMatthew Dillon.Pp
9267bda820SThomas Nikolajsen.Va vm.swapcache.hysteresis
9375cdc755SMatthew Dilloncontrols how many pages swapcache waits to be added to the inactive page
9467bda820SThomas Nikolajsenqueue before continuing its scan.
9567bda820SThomas NikolajsenOnce it decides to scan it continues subject to the above limitations
9667bda820SThomas Nikolajsenuntil it reaches the end of the inactive page queue.
9775cdc755SMatthew DillonThis parameter is designed to make swapcache generate more bulky bursts
9875cdc755SMatthew Dillonto swap which helps SSDs reduce write amplification effects.
99ed7b872cSMatthew Dillon.Sh PERFORMANCE TUNING
100ed7b872cSMatthew DillonBest operation is achieved when the active data set fits within the
101ed7b872cSMatthew Dillonswapcache.
102ed7b872cSMatthew Dillon.Pp
103ed7b872cSMatthew Dillon.Bl -tag -width 4n -compact
10467bda820SThomas Nikolajsen.It Va vm.swapcache.accrate
105ed7b872cSMatthew DillonThis specifies the burst accumulation rate in bytes per second and
106ed7b872cSMatthew Dillonultimately controls the write bandwidth to swap averaged over a long
107ed7b872cSMatthew Dillonperiod of time.
108ed7b872cSMatthew DillonThis parameter must be carefully chosen to manage the write endurance of
109ed7b872cSMatthew Dillonthe SSD in order to avoid wearing it out too quickly.
110ed7b872cSMatthew DillonEven though SSDs have limited write endurance, there is massive
111ed7b872cSMatthew Dilloncost/performance benefit to using one in a swapcache configuration.
112ed7b872cSMatthew Dillon.Pp
11367bda820SThomas NikolajsenLet's use the Intel X25V 40GB MLC SATA SSD as an example.
11467bda820SThomas NikolajsenThis device has approximately a
115a865840aSMatthew Dillon40TB (40 terabyte) write endurance, but see later
116a865840aSMatthew Dillonnotes on this, it is more a minimum value.
11767bda820SThomas NikolajsenLimiting the long term average bandwidth to 100KB/sec leads to no more
11867bda820SThomas Nikolajsenthan ~9GB/day writing which calculates approximately to a 12 year endurance.
11967bda820SThomas NikolajsenEndurance scales linearly with size.
12067bda820SThomas NikolajsenThe 80GB version of this SSD
1213ffc7051SMatthew Dillonwill have a write endurance of approximately 80TB.
1223ffc7051SMatthew Dillon.Pp
123a865840aSMatthew DillonMLC SSDs have a 1000-10000x write endurance, while the lower density
124*8b14c46eSMatthew Dillonhigher-cost SLC SSDs have a 10000-100000x write endurance, approximately.
125a865840aSMatthew DillonMLC SSDs can be used for the swapcache (and swap) as long as the system
126a865840aSMatthew Dillonmanager is cognizant of its limitations.
1273ffc7051SMatthew Dillon.Pp
12867bda820SThomas Nikolajsen.It Va vm.swapcache.meta_enable
1293ffc7051SMatthew DillonTurning on just
13067bda820SThomas Nikolajsen.Va meta_enable
1313ffc7051SMatthew Dilloncauses only filesystem meta-data to be cached and will result
13275d25c98SMatthew Dillonin very fast directory operations even over millions of inodes
13375d25c98SMatthew Dillonand even in the face of other invasive operations being run
13475d25c98SMatthew Dillonby other processes.
1353ffc7051SMatthew Dillon.Pp
13667bda820SThomas NikolajsenFor
13767bda820SThomas Nikolajsen.Nm HAMMER
13867bda820SThomas Nikolajsenfilesystems meta-data includes the B-Tree, directory entries,
13967bda820SThomas Nikolajsenand data related to tiny files.
14067bda820SThomas NikolajsenApproximately 6 GB of swapcache is needed
14126353f58SMatthew Dillonfor every 14 million or so inodes cached, effectively giving one the
14267bda820SThomas Nikolajsenability to cache all the meta-data in a multi-terabyte filesystem using
14326353f58SMatthew Dillona fairly small SSD.
14426353f58SMatthew Dillon.Pp
14567bda820SThomas Nikolajsen.It Va vm.swapcache.data_enable
1463ffc7051SMatthew DillonTurning on
14767bda820SThomas Nikolajsen.Va data_enable
14867bda820SThomas Nikolajsen(with or without other features) allows bulk file data to be cached.
1493ffc7051SMatthew DillonThis feature is very useful for web server operation when the
1503ffc7051SMatthew Dillonoperational data set fits in swap.
1513ffc7051SMatthew DillonThe usefulness is somewhat mitigated by the maximum number
1523ffc7051SMatthew Dillonof vnodes supported by the system via
15367bda820SThomas Nikolajsen.Va kern.maxfiles ,
1543ffc7051SMatthew Dillonbecause the bulk data in the cache is lost when the related
15567bda820SThomas Nikolajsenvnode is recycled.
15667bda820SThomas NikolajsenIn this case it might be desirable to
1573ffc7051SMatthew Dillontake the plunge into running a 64-bit kernel which can support
15867bda820SThomas Nikolajsenfar more vnodes.
15967bda820SThomas Nikolajsen32-bit kernels have limited kernel virtual
1603ffc7051SMatthew Dillonmemory (KVM) and cannot reliably support more than around
16167bda820SThomas Nikolajsen100,000 active vnodes.
16267bda820SThomas Nikolajsen64-bit kernels can support 300,000+ active vnodes.
163788ef3f9SMatthew Dillon.Pp
1642dc854bcSMatthew DillonData caching is definitely more wasteful of the SSD's write durability
1652dc854bcSMatthew Dillonthan meta-data caching.
1662dc854bcSMatthew DillonThe swapcache may exhaust its burst and smack against the long term
1672dc854bcSMatthew Dillonaverage bandwidth limit, causing the SSD to wear out at the maximum rate
16867bda820SThomas Nikolajsenyou programmed.
16967bda820SThomas NikolajsenData caching is far less wasteful and more efficient
170788ef3f9SMatthew Dillonif (on a 64-bit system only) you provide a sufficiently large SSD and
171788ef3f9SMatthew Dillonincrease
17267bda820SThomas Nikolajsen.Va kern.maxvnodes
173788ef3f9SMatthew Dillonto cover the entire directory topology being served.
17467bda820SThomas NikolajsenEach vnode requires about 1KB of physical RAM.
1753ffc7051SMatthew Dillon.Pp
1762dc854bcSMatthew DillonDue to the higher SSD write rate you may want to use a
1772dc854bcSMatthew Dillonmedium-sized SSD with good write performance to reduce interference
1782dc854bcSMatthew Dillonbetween reading and writing.
1792dc854bcSMatthew DillonWrite durability also scales with larger SSDs.
1802dc854bcSMatthew DillonFor example, an Intel X25-V only has 40MB/s in write performance
1812dc854bcSMatthew Dillonand burst writing by swapcache will seriously interfere with
18267bda820SThomas Nikolajsenconcurrent read operation on the SSD.
18367bda820SThomas NikolajsenThe 80GB X25-M on the otherhand has double the write performance.
1842dc854bcSMatthew Dillon.Pp
185e9b56058SMatthew DillonWhen data caching is turned on you generally want to use
186e9b56058SMatthew Dillon.Xr chflags 1
187e9b56058SMatthew Dillonwith the
18867bda820SThomas Nikolajsen.Va cache
189e9b56058SMatthew Dillonflag to enable data caching on a directory.
190e9b56058SMatthew DillonThis flag is tracked by the namecache and does not need to be
191e9b56058SMatthew Dillonrecursively set in the directory tree.
19275cdc755SMatthew DillonSimply setting the flag in a top level directory or mount point
19375cdc755SMatthew Dillonis usually sufficient.
19475cdc755SMatthew DillonHowever, the flag does not track across mount points.
195e9b56058SMatthew DillonA typical setup is something like this:
196e9b56058SMatthew Dillon.Pp
197e9b56058SMatthew Dillon.Dl chflags cache /etc /sbin /bin /usr /home
198e9b56058SMatthew Dillon.Dl chflags noscache /usr/obj
199e9b56058SMatthew Dillon.Pp
200ab19123cSMatthew DillonIf that doesn't work you can turn off
20167bda820SThomas Nikolajsen.Va vm.swapcache.use_chflags
20267bda820SThomas Nikolajsenentirely and not bother with any
20367bda820SThomas Nikolajsen.Nm chflag Ns 'ing .
204ab19123cSMatthew Dillon.Pp
20575cdc755SMatthew DillonFilesystems such as NFS which do not support flags generally
20675cdc755SMatthew Dillonhave a
20767bda820SThomas Nikolajsen.Va cache
20875cdc755SMatthew Dillonmount option which enables swapcache operation on the mount.
20975cdc755SMatthew Dillon.Pp
21067bda820SThomas Nikolajsen.It Va vm.swapcache.maxfilesize
2113ffc7051SMatthew DillonThis may be used to reduce cache thrashing when a focus on a small
2123ffc7051SMatthew Dillonpotentially fragmented filespace is desired, leaving the
2133ffc7051SMatthew Dillonlarger files alone.
2143ffc7051SMatthew Dillon.Pp
21567bda820SThomas Nikolajsen.It Va vm.swapcache.minburst
21660e72c96SJustin C. SherrillThis controls hysteresis and prevents nickel-and-dime write bursting.
2173ffc7051SMatthew DillonOnce
21867bda820SThomas Nikolajsen.Va curburst
21967bda820SThomas Nikolajsendrops to zero, writing to the swapcache ceases until it has recovered past
22067bda820SThomas Nikolajsen.Va minburst .
2213ffc7051SMatthew DillonThe idea here is to avoid creating a heavily fragmented swapcache where
2223ffc7051SMatthew Dillonreading data from a file must alternate between the cache and the primary
22367bda820SThomas Nikolajsenfilesystem.
22467bda820SThomas NikolajsenDoing so does not save disk seeks on the primary filesystem
22567bda820SThomas Nikolajsenso we want to avoid doing small bursts.
22667bda820SThomas NikolajsenThis parameter allows us to do larger bursts.
2273ffc7051SMatthew DillonThe larger bursts also tend to improve SSD performance as the SSD itself
2283ffc7051SMatthew Dilloncan do a better job write-combining and erasing blocks.
2293ffc7051SMatthew Dillon.Pp
23067bda820SThomas Nikolajsen.It Va vm_swapcache.maxswappct
231e9b56058SMatthew DillonThis controls the maximum amount of swapspace
232e9b56058SMatthew Dillon.Nm
233e9b56058SMatthew Dillonmay use, in percentage terms.
2343ffc7051SMatthew Dillon.El
2353ffc7051SMatthew Dillon.Pp
236a865840aSMatthew DillonIt is important to note that you should always use
237a865840aSMatthew Dillon.Xr disklabel64 8
23867bda820SThomas Nikolajsento label your SSD.
23967bda820SThomas NikolajsenDisklabel64 will properly align the base of the
240a865840aSMatthew Dillonpartition space relative to the physical drive regardless of how badly
241a865840aSMatthew Dillonaligned the fdisk slice is.
242a865840aSMatthew DillonThis will significantly reduce write amplification and write combining
243a865840aSMatthew Dilloninefficiencies on the SSD.
244a865840aSMatthew Dillon.Pp
2453ffc7051SMatthew DillonFinally, interleaved swap (multiple SSDs) may be used to increase
24667bda820SThomas Nikolajsenperformance even further.
24767bda820SThomas NikolajsenA single SATA SSD is typically capable of reading 120-220MB/sec.
24867bda820SThomas NikolajsenConfiguring two SSDs for your swap will
249788ef3f9SMatthew Dillonimprove aggregate swapcache read performance by 1.5x to 1.8x.
25067bda820SThomas NikolajsenIn tests with two Intel 40GB SSDs 300MB/sec was easily achieved.
251788ef3f9SMatthew Dillon.Pp
252788ef3f9SMatthew DillonAt this point you will be configuring more swap space than a 32 bit
2533ffc7051SMatthew Dillon.Dx
25467bda820SThomas Nikolajsenkernel can handle (due to KVM limitations).
25567bda820SThomas NikolajsenBy default, 32 bit
2563ffc7051SMatthew Dillon.Dx
25767bda820SThomas Nikolajsensystems only support 32GB of configured swap and while this limit
2583ffc7051SMatthew Dilloncan be increased somewhat in
2593ffc7051SMatthew Dillon.Pa /boot/loader.conf
2603ffc7051SMatthew Dillonyou should really be using a 64-bit
2613ffc7051SMatthew Dillon.Dx
26267bda820SThomas Nikolajsenkernel instead.
26367bda820SThomas Nikolajsen64-bit systems support up to 512GB of swap by default
26467bda820SThomas Nikolajsenand can be boosted to up to 8TB if you are really crazy and have enough RAM.
265788ef3f9SMatthew DillonEach 1GB of swap requires around 1MB of physical memory to manage it so
26675d25c98SMatthew Dillonthe practical limit is more around 1TB of swap.
267788ef3f9SMatthew Dillon.Pp
268788ef3f9SMatthew DillonOf course, a 1TB SSD is something on the order of $3000+ as of this writing.
26975d25c98SMatthew DillonEven though a 1TB configuration might not be cost effective, storage levels
27067bda820SThomas Nikolajsenmore in the 100-200GB range certainly are.
27167bda820SThomas NikolajsenIf the machine has only a 1GigE
272788ef3f9SMatthew Dillonethernet (100MB/s) there's no point configuring it for more SSD bandwidth.
273788ef3f9SMatthew DillonA single SSD of the desired size would be sufficient.
2743ffc7051SMatthew Dillon.Sh INITIAL BURSTING & REPEATED BURSTING
2753da63b93SSascha WildnerEven though the average write bandwidth is limited it is desirable
2763ffc7051SMatthew Dillonto have a large initial burst after boot to load the cache.
27767bda820SThomas Nikolajsen.Va curburst
2783ffc7051SMatthew Dillonis initialized to 4GB by default and you can force rebursting
2793ffc7051SMatthew Dillonby adjusting it with a sysctl.
2803ffc7051SMatthew DillonRemember that
28167bda820SThomas Nikolajsen.Va curburst
2823ffc7051SMatthew Dillondynamically tracks burst and will go up and down depending.
2833ffc7051SMatthew Dillon.Pp
2843ffc7051SMatthew DillonIn addition there will be periods of time where the system is in
28567bda820SThomas Nikolajsensteady state and not writing to the swapcache.
28667bda820SThomas NikolajsenDuring these periods
28767bda820SThomas Nikolajsen.Va curburst
2883ffc7051SMatthew Dillonwill inch back up but will not exceed
28967bda820SThomas Nikolajsen.Va maxburst .
2903ffc7051SMatthew DillonThus the
29167bda820SThomas Nikolajsen.Va maxburst
2923ffc7051SMatthew Dillonvalue controls how large a repeated burst can be.
2933ffc7051SMatthew Dillon.Pp
2943ffc7051SMatthew DillonA second bursting parameter called
29567bda820SThomas Nikolajsen.Va vm.swapcache.minburst
2963ffc7051SMatthew Dilloncontrols bursting when the maximum write bandwidth has been reached.
2973ffc7051SMatthew DillonWhen
29867bda820SThomas Nikolajsen.Va minburst
2993ffc7051SMatthew Dillonreaches zero write activity ceases and
30067bda820SThomas Nikolajsen.Va curburst
3013ffc7051SMatthew Dillonis allowed to recover up to
30267bda820SThomas Nikolajsen.Va minburst
30367bda820SThomas Nikolajsenbefore write activity resumes.
30467bda820SThomas NikolajsenThe recommended range for the
30567bda820SThomas Nikolajsen.Va minburst
30667bda820SThomas Nikolajsenparameter is 1MB to 50MB.
30767bda820SThomas NikolajsenThis parameter has a relationship to
3083ffc7051SMatthew Dillonhow fragmented the swapcache gets when not in a steady state.
3093ffc7051SMatthew DillonLarge bursts reduce fragmentation and reduce incidences of
31067bda820SThomas Nikolajsenexcessive seeking on the hard drive.
31167bda820SThomas NikolajsenIf set too low the
3123ffc7051SMatthew Dillonswapcache will become fragmented within a single regular file
3133ffc7051SMatthew Dillonand the constant back-and-forth between the swapcache and the
3143ffc7051SMatthew Dillonhard drive will result in excessive seeking on the hard drive.
3153ffc7051SMatthew Dillon.Sh SWAPCACHE SIZE & MANAGEMENT
316e9b56058SMatthew DillonThe swapcache feature will use up to 75% of configured swap space
317e9b56058SMatthew Dillonby default.
3183ffc7051SMatthew DillonThe remaining 25% is reserved for normal paging operation.
31975d25c98SMatthew DillonThe system operator should configure at least 4 times the SWAP space
32067bda820SThomas Nikolajsenversus main memory and no less than 8GB of swap space.
32167bda820SThomas NikolajsenIf a 40GB SSD is used the recommendation is to configure 16GB to 32GB of
32267bda820SThomas Nikolajsenswap (note: 32-bit is limited to 32GB of swap by default, for 64-bit
32367bda820SThomas Nikolajsenit is 512GB of swap), and to leave the remainder unwritten and unused.
3243ffc7051SMatthew Dillon.Pp
325e9b56058SMatthew DillonThe
32667bda820SThomas Nikolajsen.Va vm_swapcache.maxswappct
327e9b56058SMatthew Dillonsysctl may be used to change the default.
328e9b56058SMatthew DillonYou may have to change this default if you also use
329e9b56058SMatthew Dillon.Xr tmpfs 5 ,
330e9b56058SMatthew Dillon.Xr vn 4 ,
331e9b56058SMatthew Dillonor if you have not allocated enough swap for reasonable normal paging
332e9b56058SMatthew Dillonactivity to occur (in which case you probably shouldn't be using
333e9b56058SMatthew Dillon.Nm
334e9b56058SMatthew Dillonanyway).
335e9b56058SMatthew Dillon.Pp
3363ffc7051SMatthew DillonIf swapcache reaches the 75% limit it will begin tearing down swap
3373ffc7051SMatthew Dillonin linear bursts by iterating through available VM objects, until
33867bda820SThomas Nikolajsenswap space use drops to 70%.
33967bda820SThomas NikolajsenThe tear-down is limited by the rate at
34067bda820SThomas Nikolajsenwhich new data is written and this rate in turn is often limited by
34167bda820SThomas Nikolajsen.Va vm.swapcache.accrate ,
3423ffc7051SMatthew Dillonresulting in an orderly replacement of cached data and meta-data.
3433ffc7051SMatthew DillonThe limit is typically only reached when doing full data+meta-data
3443ffc7051SMatthew Dilloncaching with no file size limitations and serving primarily large
34567bda820SThomas Nikolajsenfiles, or (on a 64-bit system) bumping
34667bda820SThomas Nikolajsen.Va kern.maxvnodes
34767bda820SThomas Nikolajsenup to very high values.
348788ef3f9SMatthew Dillon.Sh NORMAL SWAP PAGING ACTIVITY WITH SSD SWAP
349788ef3f9SMatthew DillonThis is not a function of
350788ef3f9SMatthew Dillon.Nm
35167bda820SThomas Nikolajsenper se but instead a normal function of the system.
35267bda820SThomas NikolajsenMost systems have
35367bda820SThomas Nikolajsensufficient memory that they do not need to page memory to swap.
35467bda820SThomas NikolajsenThese types of systems are the ones best suited for MLC SSD
35567bda820SThomas Nikolajsenconfigured swap running with a
356788ef3f9SMatthew Dillon.Nm
357788ef3f9SMatthew Dillonconfiguration.
358788ef3f9SMatthew DillonSystems which modestly page to swap, in the range of a few hundred
359788ef3f9SMatthew Dillonmegabytes a day worth of writing, are also well suited for MLC SSD
36067bda820SThomas Nikolajsenconfigured swap.
36167bda820SThomas NikolajsenDesktops usually fall into this category even if they
362788ef3f9SMatthew Dillonpage out a bit more because swap activity is governed by the actions of
363788ef3f9SMatthew Dillona single person.
364788ef3f9SMatthew Dillon.Pp
365788ef3f9SMatthew DillonSystems which page anonymous memory heavily when
366788ef3f9SMatthew Dillon.Nm
367788ef3f9SMatthew Dillonwould otherwise be turned off are not usually well suited for MLC SSD
36867bda820SThomas Nikolajsenconfigured swap.
36967bda820SThomas NikolajsenHeavy paging activity is not governed by
370788ef3f9SMatthew Dillon.Nm
371788ef3f9SMatthew Dillonbandwidth control parameters and can lead to excessive uncontrolled
37267bda820SThomas Nikolajsenwriting to the MLC SSD, causing premature wearout.
37367bda820SThomas NikolajsenYou would have to use the lower density, more expensive SLC SSD
37467bda820SThomas Nikolajsentechnology (which has 10x the durability).
37567bda820SThomas NikolajsenThis isn't to say that
376788ef3f9SMatthew Dillon.Nm
377788ef3f9SMatthew Dillonwould be ineffective, just that the aggregate write bandwidth required
378788ef3f9SMatthew Dillonto support the system would be too large for MLC flash technologies.
379788ef3f9SMatthew Dillon.Pp
38060e72c96SJustin C. SherrillWith this caveat in mind, SSD based paging on systems with insufficient
38167bda820SThomas NikolajsenRAM can be extremely effective in extending the useful life of the system.
38267bda820SThomas NikolajsenFor example, a system with a measly 192MB of RAM and SSD swap can run
38375d25c98SMatthew Dillona -j 8 parallel build world in a little less than twice the time it
38467bda820SThomas Nikolajsenwould take if the system had 2GB of RAM, whereas it would take 5x to 10x
385788ef3f9SMatthew Dillonas long with normal HD based swap.
386147a04c3SMatthew Dillon.Sh USING SWAPCACHE WITH NORMAL HARD DRIVES
387147a04c3SMatthew DillonAlthough
388147a04c3SMatthew Dillon.Nm
389147a04c3SMatthew Dillonis designed to work with SSD-based storage it can also be used with
390147a04c3SMatthew DillonHD-based storage as an aid for offloading the primary storage system.
391147a04c3SMatthew DillonHere we need to make a distinction between using RAID for fanning out
392147a04c3SMatthew Dillonstorage verses using RAID for redundancy.  There are numerous situations
393147a04c3SMatthew Dillonwhere RAID-based redundancy does not make sense.
394147a04c3SMatthew Dillon.Pp
395147a04c3SMatthew DillonA good example would be in an environment where the servers themselves
396147a04c3SMatthew Dillonare redundant and can suffer a total failure without effecting
397147a04c3SMatthew Dillonongoing operations.  When the primary storage requirements easily fit onto
398147a04c3SMatthew Dillona single large-capacity drive it doesn't make a whole lot of sense to
399147a04c3SMatthew Dillonuse RAID if your only desire is to improve performance.  If you had a farm
400147a04c3SMatthew Dillonof, say, 20 servers supporting the same facility adding RAID to each one
401147a04c3SMatthew Dillonwould not accomplish anything other than to bloat your deployment and
402e8b22b55SSascha Wildnermaintenance costs.
403147a04c3SMatthew Dillon.Pp
404e8b22b55SSascha WildnerIn these sorts of situations it may be desirable and convenient to have
405147a04c3SMatthew Dillonthe primary filesystem for each machine on a single large drive and then
406147a04c3SMatthew Dillonuse the
407147a04c3SMatthew Dillon.Nm
408147a04c3SMatthew Dillonfacility to offload the drive and make the machine more effective without
409147a04c3SMatthew Dillonactually distributing the filesystem itself across multiple drives.
410147a04c3SMatthew DillonFor the purposes of offloading while a SSD would be the most effective
411147a04c3SMatthew Dillonfrom a performance standpoint, a second medium sized HD with its much lower
412147a04c3SMatthew Dilloncost and higher capacity might actually be more cost effective.
413147a04c3SMatthew Dillon.Pp
414147a04c3SMatthew DillonIn cases where you might desire to use
415147a04c3SMatthew Dillon.Nm
416147a04c3SMatthew Dillonwith a normal hard drive you should probably consider running a 64-bit
417147a04c3SMatthew Dillon.Dx
418147a04c3SMatthew Dilloninstead of a 32-bit system.
419147a04c3SMatthew DillonThe 64-bit build is capable of supporting much larger swap configurations
420147a04c3SMatthew Dillon(upwards of 512G) and would be a more suitable match against a medium-sized
421147a04c3SMatthew DillonHD.
422*8b14c46eSMatthew Dillon.Sh EXPLANATION OF STATIC VS DYNAMIC WEARING LEVELING, AND WRITE-COMBINING
423*8b14c46eSMatthew DillonModern SSDs keep track of space that has never been written to.
424*8b14c46eSMatthew DillonThis would also include space freed up via TRIM, but simply not
425*8b14c46eSMatthew Dillontouching a bit of storage in a factory fresh SSD works just as well.
426*8b14c46eSMatthew DillonOnce you touch (write to) the storage all bets are off, even if
427*8b14c46eSMatthew Dillonyou reformat/repartition later.  It takes sending the SSD a
428*8b14c46eSMatthew Dillonwhole-device TRIM command or special format command to take it back
429*8b14c46eSMatthew Dillonto its factory-fresh condition (sans wear already present).
430*8b14c46eSMatthew Dillon.Pp
431*8b14c46eSMatthew DillonSSDs have wear leveling algorithms which are responsible for trying
432*8b14c46eSMatthew Dillonto even out the erase/write cycles across all flash cells in the
433*8b14c46eSMatthew Dillonstorage.  The better a job the SSD can do the longer the SSD will
434*8b14c46eSMatthew Dillonremain useable.
435*8b14c46eSMatthew Dillon.Pp
436*8b14c46eSMatthew DillonThe more unused storage there is from the SSDs point of view the
437*8b14c46eSMatthew Dilloneasier a time the SSD has running its wear leveling algorithms.
438*8b14c46eSMatthew DillonBasically the wear leveling algorithm in a modern SSD (say Intel or OCZ)
439*8b14c46eSMatthew Dillonuses a combination of static and dynamic leveling.  Static is the
440*8b14c46eSMatthew Dillonbest, allowing the SSD to reuse flash cells that have not been
441*8b14c46eSMatthew Dillonerased very much by moving static (unchanging) data out of them and
442*8b14c46eSMatthew Dilloninto other cells that have more wear.  Dynamic wear leveling involves
443*8b14c46eSMatthew Dillonwriting data to available flash cells and then marking the cells containing
444*8b14c46eSMatthew Dillonthe previous copy of the data as being free/reusable.  Dynamic wear leveling
445*8b14c46eSMatthew Dillonis the worst kind but the easiest to implement.  Modern SSDs use a combination
446*8b14c46eSMatthew Dillonof both algorithms plus also do write-combining.
447*8b14c46eSMatthew Dillon.Pp
448*8b14c46eSMatthew DillonUSB sticks often use only dynamic wear leveling and have short life spans
449*8b14c46eSMatthew Dillonbecause of that.
450*8b14c46eSMatthew Dillon.Pp
451*8b14c46eSMatthew DillonIn anycase, any unused space in the SSD effectively makes the dynamic
452*8b14c46eSMatthew Dillonwear leveling the SSD does more efficient by giving the SSD more 'unused'
453*8b14c46eSMatthew Dillonspace above and beyond the physical space it reserves beyond its stated
454*8b14c46eSMatthew Dillonstorage capacity to cycle data throgh, so the SSD lasts longer in theory.
455*8b14c46eSMatthew Dillon.Pp
456*8b14c46eSMatthew DillonWrite-combining is a feature whereby the SSD is able to reduced write
457*8b14c46eSMatthew Dillonamplification effects by combining OS writes of smaller, discrete,
458*8b14c46eSMatthew Dillonnon-contiguous logical sectors into a single contiguous 128KB physical
459*8b14c46eSMatthew Dillonflash block.
460*8b14c46eSMatthew Dillon.Pp
461*8b14c46eSMatthew DillonOn the flip side write-combining also results in more complex lookup tables
462*8b14c46eSMatthew Dillonwhich can become fragmented over time and reduce the SSDs read performance.
463*8b14c46eSMatthew DillonFragmentation can also occur when write-combined blocks are rewritten
464*8b14c46eSMatthew Dillonpiecemeal.
465*8b14c46eSMatthew DillonModern SSDs can regain the lost performance by de-combining previously
466*8b14c46eSMatthew Dillonwrite-combined areas as part of their static wear leveling algorithm, but
467*8b14c46eSMatthew Dillonat the cost of extra write/erase cycles which slightly increase write
468*8b14c46eSMatthew Dillonamplification effects.
469*8b14c46eSMatthew DillonOperating systems can also help maintain the SSDs performance by utilizing
470*8b14c46eSMatthew Dillonlarger blocks.
471*8b14c46eSMatthew DillonWrite-combining results in a net-reduction
472*8b14c46eSMatthew Dillonof write-amplification effects but due to having to de-combine later and
473*8b14c46eSMatthew Dillonother fragmentory effects it isn't 100%.
474*8b14c46eSMatthew DillonFrom testing with Intel devices write-amplification can be well controlled
475*8b14c46eSMatthew Dillonin the 2x-4x range with the OS doing 16K writes, verses a worst-case
476*8b14c46eSMatthew Dillon8x write-amplification with 16K blocks, 32x with 4K blocks, and a truly
477*8b14c46eSMatthew Dillonhorrid worst-case with 512 byte blocks.
478*8b14c46eSMatthew Dillon.Pp
479*8b14c46eSMatthew DillonThe
480*8b14c46eSMatthew Dillon.Dx
481*8b14c46eSMatthew Dillon.Nm
482*8b14c46eSMatthew Dillonfeature utilizes 64K-128K writes and is specifically designed to minimize
483*8b14c46eSMatthew Dillonwrite amplification and write-combining stresses.
484*8b14c46eSMatthew DillonIn terms of placing an actual filesystem on the SSD, the
485*8b14c46eSMatthew Dillon.Dx
486*8b14c46eSMatthew Dillon.Xr hammer 8
487*8b14c46eSMatthew Dillonfilesystem utilizes 16K blocks and is well behaved as long as you limit
488*8b14c46eSMatthew Dillonreblocking operations.
489*8b14c46eSMatthew DillonFor UFS you should create the filesystem with at least a 4K fragment
490*8b14c46eSMatthew Dillonsize, verses the default 2K.
491*8b14c46eSMatthew DillonModern Windows filesystems use 4K clusters but it is unclear how SSD-friendly
492*8b14c46eSMatthew DillonNTFS is.
49375d25c98SMatthew Dillon.Sh WARNINGS
494a865840aSMatthew DillonI am going to repeat and expand a bit on SSD wear.
495a865840aSMatthew DillonWear on SSDs is a function of the write durability of the cells,
496*8b14c46eSMatthew Dillonwhether the SSD implements static or dynamic wear leveling (or both),
497*8b14c46eSMatthew Dillonwrite amplification effects when the OS does not issue write-aligned 128KB
498*8b14c46eSMatthew Dillonops or when the SSD is unable to write-combine adjacent logical sectors,
499*8b14c46eSMatthew Dillonor if the SSD has a poor write-combining algorithm for non-adjacent sectors.
500*8b14c46eSMatthew DillonIn addition some additional erase/rewrite activity occurs from cleanup
501*8b14c46eSMatthew Dillonoperations the SSD performs as part of its static wear leveling algorithms
502*8b14c46eSMatthew Dillonand its write-decombining algorithms (necessary to maintain performance over
503*8b14c46eSMatthew Dillontime).  MLC flash uses 128KB physical write/erase blocks while SLC flash
504*8b14c46eSMatthew Dillontypically uses 64KB physical write/erase blocks.
505*8b14c46eSMatthew Dillon.Pp
506*8b14c46eSMatthew DillonThe algorithms the SSD implements in its firmware are probably the most
507*8b14c46eSMatthew Dillonimportant part of the device and a major differentiator between e.g. SATA
508*8b14c46eSMatthew Dillonand USB-based SSDs.  SATA form factor drives will universally be far superior
509*8b14c46eSMatthew Dillonto USB storage sticks.
510*8b14c46eSMatthew DillonSSDs can also have wildly different wearout rates and wildly different
511*8b14c46eSMatthew Dillonperformance curves over time.
512*8b14c46eSMatthew DillonFor example the performance of a SSD which does not implement
513*8b14c46eSMatthew Dillonwrite-decombining can seriously degrade over time as its lookup
514*8b14c46eSMatthew Dillontables become severely fragmented.
515*8b14c46eSMatthew DillonFor the purposes of this manual page we are primarily using Intel and OCZ
516*8b14c46eSMatthew Dillondrives when describing performance and wear issues.
517a865840aSMatthew Dillon.Pp
5183ffc7051SMatthew Dillon.Nm
5193ffc7051SMatthew Dillonparameters should be carefully chosen to avoid early wearout.
52067bda820SThomas NikolajsenFor example, the Intel X25V 40GB SSD has a minimum write durability
521a865840aSMatthew Dillonof 40TB and an actual durability that can be quite a bit higher.
52260e72c96SJustin C. SherrillGenerally speaking, you want to select parameters that will give you
523a865840aSMatthew Dillonat least 10 years of service life.
524a865840aSMatthew DillonThe most important parameter to control this is
52567bda820SThomas Nikolajsen.Va vm.swapcache.accrate .
526a865840aSMatthew Dillon.Nm
527a865840aSMatthew Dillonuses a very conservative 100KB/sec default but even a small X25V
52867bda820SThomas Nikolajsencan probably handle 300KB/sec of continuous writing and still last 10 years.
5293ffc7051SMatthew Dillon.Pp
530a865840aSMatthew DillonDepending on the wear leveling algorithm the drive uses, durability
531a865840aSMatthew Dillonand performance can sometimes be improved by configuring less
532a865840aSMatthew Dillonspace (in a manufacturer-fresh drive) than the drive's probed capacity.
53367bda820SThomas NikolajsenFor example, by only using 32GB of a 40GB SSD.
534a865840aSMatthew DillonSSDs typically implement 10% more storage than advertised and
53567bda820SThomas Nikolajsenuse this storage to improve wear leveling.
53667bda820SThomas NikolajsenAs cells begin to fail
53775d25c98SMatthew Dillonthis overallotment slowly becomes part of the primary storage
53867bda820SThomas Nikolajsenuntil it has been exhausted.
53967bda820SThomas NikolajsenAfter that the SSD has basically failed.
54060e72c96SJustin C. SherrillKeep in mind that if you use a larger portion of the SSD's advertised
54175d25c98SMatthew Dillonstorage the SSD will not know if/when you decide to use less unless
54275d25c98SMatthew Dillonappropriate TRIM commands are sent (if supported), or a low level
54375d25c98SMatthew Dillonfactory erase is issued.
5443ffc7051SMatthew Dillon.Pp
545788ef3f9SMatthew Dillon.Nm smartctl
54660e72c96SJustin C. Sherrill(from pkgsrc's sysutils/smartmontools) may be used to retrieve
54760e72c96SJustin C. Sherrillthe wear indicator from the drive.
54867bda820SThomas NikolajsenOne usually runs something like
54967bda820SThomas Nikolajsen.Ql smartctl -d sat -a /dev/daXX
55067bda820SThomas Nikolajsen(for AHCI/SILI/SCSI), or
55167bda820SThomas Nikolajsen.Ql smartctl -a /dev/adXX
55267bda820SThomas Nikolajsenfor NATA.
55367bda820SThomas NikolajsenSome SSDs
554a865840aSMatthew Dillon(particularly the Intels) will brick the SATA port when smart operations
555a865840aSMatthew Dillonare done while the drive is busy with normal activity, so the tool should
556a865840aSMatthew Dillononly be run when the SSD is idle.
557788ef3f9SMatthew Dillon.Pp
55860e72c96SJustin C. SherrillID 232 (0xe8) in the SMART data dump indicates available reserved
55967bda820SThomas Nikolajsenspace and ID 233 (0xe9) is the wear-out meter.
56067bda820SThomas NikolajsenReserved space
56175d25c98SMatthew Dillontypically starts at 100 and decrements to 10, after which the SSD
56267bda820SThomas Nikolajsenis considered to operate in a degraded mode.
56367bda820SThomas NikolajsenThe wear-out meter typically starts at 99 and decrements to 0,
56467bda820SThomas Nikolajsenafter which the SSD has failed.
565a865840aSMatthew Dillon.Pp
56675d25c98SMatthew Dillon.Nm
56767bda820SThomas Nikolajsentends to use large 64KB writes and tends to cluster multiple writes
56867bda820SThomas Nikolajsenlinearly.
56967bda820SThomas NikolajsenThe SSD is able to take significant advantage of this
57067bda820SThomas Nikolajsenand write amplification effects are greatly reduced.
57167bda820SThomas NikolajsenIf we take a 40GB Intel X25V as an example the vendor specifies a write
572a865840aSMatthew Dillondurability of approximately 40TB, but
573a865840aSMatthew Dillon.Nm
574a865840aSMatthew Dillonshould be able to squeeze out upwards of 200TB due the fairly optimal
575a865840aSMatthew Dillonwrite clustering it does.
576a865840aSMatthew DillonThe theoretical limit for the Intel X25V is 400TB (10,000 erase cycles
57767bda820SThomas Nikolajsenper MLC cell, 40GB drive), but the firmware doesn't do perfect static
578a865840aSMatthew Dillonwear leveling so the actual durability is less.
579955b4283SMatthew DillonIn tests over several hundred days we have validated a write endurance
580955b4283SMatthew Dillongreater than 200TB on the 40G Intel X25V using
581955b4283SMatthew Dillon.Nm .
582a865840aSMatthew Dillon.Pp
583*8b14c46eSMatthew DillonIn contrast, filesystems directly stored on a SSD could have
584a865840aSMatthew Dillonfairly severe write amplification effects and will have durabilities
585a865840aSMatthew Dillonranging closer to the vendor-specified limit.
586955b4283SMatthew Dillon.Pp
58767bda820SThomas NikolajsenPower-on hours, power cycles, and read operations do not really affect wear.
588955b4283SMatthew DillonThere is something called read-disturb but it is unclear what sort of
589955b4283SMatthew Dillonratio would be needed.  Since the data is cached in ram and thus not
590955b4283SMatthew Dillonre-read at a high rate there is no expectation of a practical effect.
591955b4283SMatthew DillonFor all intents and purposes only write operations effect wear.
592788ef3f9SMatthew Dillon.Pp
593788ef3f9SMatthew DillonSSD's with MLC-based flash technology are high-density, low-cost solutions
59467bda820SThomas Nikolajsenwith limited write durability.
59567bda820SThomas NikolajsenSLC-based flash technology is a low-density,
59667bda820SThomas Nikolajsenhigher-cost solution with 10x the write durability as MLC.
59767bda820SThomas NikolajsenThe durability also scales with the amount of flash storage.
59867bda820SThomas NikolajsenSLC based flash is typically
59967bda820SThomas Nikolajsentwice as expensive per gigabyte.
60067bda820SThomas NikolajsenFrom a cost perspective, SLC based flash
60141bc85bbSMatthew Dillonis at least 5x more cost effective in situations where high write
60267bda820SThomas Nikolajsenbandwidths are required (because it lasts 10x longer).
60367bda820SThomas NikolajsenMLC is at least 2x more cost effective in situations where high
60467bda820SThomas Nikolajsenwrite bandwidth is not required.
605a865840aSMatthew DillonWhen wear calculations are in years, these differences become huge, but
606a865840aSMatthew Dillonoften the quantity of storage needed trumps the wear life so we expect most
607a865840aSMatthew Dillonpeople will be using MLC.
608788ef3f9SMatthew Dillon.Nm
609788ef3f9SMatthew Dillonis usable with both technologies.
6103ffc7051SMatthew Dillon.Sh SEE ALSO
61167bda820SThomas Nikolajsen.Xr chflags 1 ,
61245b74f6eSSascha Wildner.Xr fstab 5 ,
613a865840aSMatthew Dillon.Xr disklabel64 8 ,
614*8b14c46eSMatthew Dillon.Xr hammer 8 ,
61545b74f6eSSascha Wildner.Xr swapon 8
6163ffc7051SMatthew Dillon.Sh HISTORY
6173ffc7051SMatthew Dillon.Nm
6183ffc7051SMatthew Dillonfirst appeared in
6193ffc7051SMatthew Dillon.Dx 2.5 .
6203ffc7051SMatthew Dillon.Sh AUTHORS
6213ffc7051SMatthew Dillon.An Matthew Dillon
622