xref: /spdk/scripts/perf/nvmf/README.md (revision fecffda6ecf8853b82edccde429b68252f0a62c5)
1# Running NVMe-OF Performance Test Cases
2
3Scripts contained in this directory are used to run TCP and RDMA benchmark tests,
4that are later published at [spdk.io performance reports section](https://spdk.io/doc/performance_reports.html).
5To run the scripts in your environment please follow steps below.
6
7## Test Systems Requirements
8
9- The OS installed on test systems must be a Linux OS.
10  Scripts were primarily used on systems with Fedora and
11  Ubuntu 18.04/20.04 distributions.
12- Each test system must have at least one RDMA-capable NIC installed for RDMA tests.
13  For TCP tests any TCP-capable NIC will do. However, high-bandwidth,
14  high-performance NICs like Intel E810 CQDA2 or Mellanox ConnectX-5 are
15  suggested because the NVMe-oF workload is network bound.
16  So, if you use a NIC capable of less than 100Gbps on NVMe-oF target
17  system, you will quickly saturate your NICs.
18- Python3 interpreter must be available on all test systems.
19  Paramiko and Pandas modules must be installed.
20- nvmecli package must be installed on all test systems.
21- fio must be downloaded from [Github](https://github.com/axboe/fio) and built.
22  This must be done on Initiator test systems to later build SPDK with
23  "--with-fio" option.
24- All test systems must have a user account with a common name,
25  password and passwordless sudo enabled.
26- [mlnx-tools](https://github.com/Mellanox/mlnx-tools) package must be downloaded
27  to /usr/src/local directory in order to configure NIC ports IRQ affinity.
28  If custom directory is to be used, then it must be set using irq_scripts_dir
29  option in Target and Initiator configuration sections.
30- `sysstat` package must be installed for SAR CPU utilization measurements.
31- `bwm-ng` package must be installed for NIC bandwidth utilization measurements.
32- `pcm` package must be installed for pcm, pcm-power and pcm-memory measurements.
33
34### Optional
35
36- For test using the Kernel Target, nvmet-cli must be downloaded and build on Target system.
37  nvmet-cli is available [here](http://git.infradead.org/users/hch/nvmetcli.git).
38
39## Manual configuration
40
41Before running the scripts some manual test systems configuration is required:
42
43- Configure IP address assignment on the NIC ports that will be used for test.
44  Make sure to make these assignments persistent, as in some cases NIC drivers may be reloaded.
45- Adjust firewall service to allow traffic on IP - port pairs used in test
46  (or disable firewall service completely if possible).
47- Adjust or completely disable local security engines like AppArmor or SELinux.
48
49## JSON configuration for test run automation
50
51An example json configuration file with the minimum configuration required
52to automate NVMe-oF testing is provided in this repository.
53The following sub-chapters describe each configuration section in more detail.
54
55### General settings section
56
57``` ~sh
58"general": {
59    "username": "user",
60    "password": "password",
61    "transport": "transport_type",
62    "skip_spdk_install": bool
63}
64```
65
66Required:
67
68- username - username for the SSH session
69- password - password for the SSH session
70- transport - transport layer to be used throughout the test ("tcp" or "rdma")
71
72Optional:
73
74- skip_spdk_install - by default SPDK sources will be copied from Target
75  to the Initiator systems each time run_nvmf.py script is run. If the SPDK
76  is already in place on Initiator systems and there's no need to re-build it,
77  then set this option to true.
78  Default: false.
79
80### Target System Configuration
81
82``` ~sh
83"target": {
84  "mode": "spdk",
85  "nic_ips": ["192.0.1.1", "192.0.2.1"],
86  "core_mask": "[1-10]",
87  "null_block_devices": 8,
88  "nvmet_bin": "/path/to/nvmetcli",
89  "sar_settings": true,
90  "pcm_settings": false,
91  "enable_bandwidth": [true, 60],
92  "enable_dpdk_memory": true
93  "num_shared_buffers": 4096,
94  "scheduler_settings": "static",
95  "zcopy_settings": false,
96  "dif_insert_strip": true,
97  "null_block_dif_type": 3,
98  "pm_settings": [true, 30, 1, 60]
99}
100```
101
102Required:
103
104- mode - Target application mode, "spdk" or "kernel".
105- nic_ips - IP addresses of NIC ports to be used by the target to export
106  NVMe-oF subsystems.
107- core_mask - Used by SPDK target only.
108  CPU core mask either in form of actual mask (i.e. 0xAAAA) or core list
109  (i.e. [0,1,2-5,6).
110  At this moment the scripts cannot restrict the Kernel target to only
111  use certain CPU cores. Important: upper bound of the range is inclusive!
112
113Optional, common:
114
115- null_block_devices - int, number of null block devices to create.
116  Detected NVMe devices are not used if option is present. Default: 0.
117- sar_settings - bool
118  Enable SAR CPU utilization measurement on Target side. SAR thread will
119  wait until fio finishes it's "ramp_time" and then start measurement for
120  fio "run_time" duration. Default: enabled.
121- pcm_settings - bool
122  Enable [PCM](https://github.com/opcm/pcm.git) measurements on Target side.
123  Measurements include CPU, memory and power consumption. Default: enabled.
124- enable_bandwidth - bool. Measure bandwidth utilization on network
125  interfaces. Default: enabled.
126- tuned_profile - tunedadm profile to apply on the system before starting
127  the test.
128- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package;
129  Used to run set_irq_affinity.sh script.
130  Default: /usr/src/local/mlnx-tools/ofed_scripts
131- enable_pm - bool;
132  if bool is set to true, power measurement is enabled via collect-bmc-pm on
133  the target side. Default: true.
134
135Optional, Kernel Target only:
136
137- nvmet_bin - path to nvmetcli binary, if not available in $PATH.
138  Only for Kernel Target. Default: "nvmetcli".
139
140Optional, SPDK Target only:
141
142- zcopy_settings - bool. Disable or enable target-size zero-copy option.
143  Default: false.
144- scheduler_settings - str. Select SPDK Target thread scheduler (static/dynamic).
145  Default: static.
146- num_shared_buffers - int, number of shared buffers to allocate when
147  creating transport layer. Default: 4096.
148- max_queue_depth - int, max number of outstanding I/O per queue. Default: 128.
149- dif_insert_strip - bool. Only for TCP transport. Enable DIF option when
150  creating transport layer. Default: false.
151- null_block_dif_type - int, 0-3. Level of DIF type to use when creating
152  null block bdev. Default: 0.
153- enable_dpdk_memory - bool. Wait for a fio ramp_time to finish and
154  call env_dpdk_get_mem_stats RPC call to dump DPDK memory stats.
155  Default: enabled.
156- adq_enable - bool; only for TCP transport.
157  Configure system modules, NIC settings and create priority traffic classes
158  for ADQ testing. You need and ADQ-capable NIC like the Intel E810.
159- bpf_scripts - list of bpftrace scripts that will be attached during the
160  test run. Available scripts can be found in the spdk/scripts/bpf directory.
161- dsa_settings - bool. Only for TCP transport. Enable offloading CRC32C
162  calculation to DSA. You need a CPU with the Intel(R) Data Streaming
163  Accelerator (DSA) engine.
164- scheduler_core_limit - int, 0-100. Dynamic scheduler option to load limit on
165  the core to be considered full.
166
167### Initiator system settings section
168
169There can be one or more `initiatorX` setting sections, depending on the test setup.
170
171``` ~sh
172"initiator1": {
173  "ip": "10.0.0.1",
174  "nic_ips": ["192.0.1.2"],
175  "target_nic_ips": ["192.0.1.1"],
176  "mode": "spdk",
177  "fio_bin": "/path/to/fio/bin",
178  "nvmecli_bin": "/path/to/nvmecli/bin",
179  "cpus_allowed": "0,1,10-15",
180  "cpus_allowed_policy": "shared",
181  "num_cores": 4,
182  "cpu_frequency": 2100000,
183  "adq_enable": false,
184  "kernel_engine": "io_uring"
185}
186```
187
188Required:
189
190- ip - management IP address of initiator system to set up SSH connection.
191- nic_ips - list of IP addresses of NIC ports to be used in test,
192  local to given initiator system.
193- target_nic_ips - list of IP addresses of Target NIC ports to which initiator
194  will attempt to connect to.
195- mode - initiator mode, "spdk" or "kernel". For SPDK, the bdev fio plugin
196  will be used to connect to NVMe-oF subsystems and submit I/O. For "kernel",
197  nvmecli will be used to connect to NVMe-oF subsystems and fio will use the
198  libaio ioengine to submit I/Os.
199
200Optional, common:
201
202- nvmecli_bin - path to nvmecli binary; Will be used for "discovery" command
203  (for both SPDK and Kernel modes) and for "connect" (in case of Kernel mode).
204  Default: system-wide "nvme".
205- fio_bin - path to custom fio binary, which will be used to run IO.
206  Additionally, the directory where the binary is located should also contain
207  fio sources needed to build SPDK fio_plugin for spdk initiator mode.
208  Default: /usr/src/fio/fio.
209- cpus_allowed - str, list of CPU cores to run fio threads on. Takes precedence
210  before `num_cores` setting. Default: None (CPU cores randomly allocated).
211  For more information see `man fio`.
212- cpus_allowed_policy - str, "shared" or "split". CPU sharing policy for fio
213  threads. Default: shared. For more information see `man fio`.
214- num_cores - By default fio threads on initiator side will use as many CPUs
215  as there are connected subsystems. This option limits the number of CPU cores
216  used for fio threads to this number; cores are allocated randomly and fio
217  `filename` parameters are grouped if needed. `cpus_allowed` option takes
218  precedence and `num_cores` is ignored if both are present in config.
219- cpu_frequency - int, custom CPU frequency to set. By default test setups are
220  configured to run in performance mode at max frequencies. This option allows
221  user to select CPU frequency instead of running at max frequency. Before
222  using this option `intel_pstate=disable` must be set in boot options and
223  cpupower governor be set to `userspace`.
224- tuned_profile - tunedadm profile to apply on the system before starting
225  the test.
226- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package;
227  Used to run set_irq_affinity.sh script.
228  Default: /usr/src/local/mlnx-tools/ofed_scripts
229- kernel_engine - Select fio ioengine mode to run tests. io_uring libraries and
230  io_uring capable fio binaries must be present on Initiator systems!
231  Available options:
232  - libaio (default)
233  - io_uring
234
235Optional, SPDK Initiator only:
236
237- adq_enable - bool; only for TCP transport. Configure system modules, NIC
238  settings and create priority traffic classes for ADQ testing.
239  You need an ADQ-capable NIC like Intel E810.
240- enable_data_digest - bool; only for TCP transport. Enable the data
241  digest for the bdev controller. The target can use IDXD to calculate the
242  data digest or fallback to a software optimized implementation on system
243  that don't have the Intel(R) Data Streaming Accelerator (DSA) engine.
244
245### Fio settings section
246
247``` ~sh
248"fio": {
249  "bs": ["4k", "128k"],
250  "qd": [32, 128],
251  "rw": ["randwrite", "write"],
252  "rwmixread": 100,
253  "rate_iops": 10000,
254  "num_jobs": 2,
255  "offset": true,
256  "offset_inc": 10,
257  "run_time": 30,
258  "ramp_time": 30,
259  "run_num": 3
260}
261```
262
263Required:
264
265- bs - fio IO block size
266- qd -  fio iodepth
267- rw - fio rw mode
268- rwmixread - read operations percentage in case of mixed workloads
269- num_jobs - fio numjobs parameter
270  Note: may affect total number of CPU cores used by initiator systems
271- run_time - fio run time
272- ramp_time - fio ramp time, does not do measurements
273- run_num - number of times each workload combination is run.
274  If more than 1 then final result is the average of all runs.
275
276Optional:
277
278- rate_iops - limit IOPS to this number
279- offset - bool; enable offseting of the IO to the file. When this option is
280  enabled the file is "split" into a number of chunks equal to "num_jobs"
281  parameter value, and each "num_jobs" fio thread gets it's own chunk to
282  work with.
283  For more detail see "offset", "offset_increment" and "size" in fio man
284  pages. Default: false.
285- offset_inc - int; Percentage value determining the offset, size and
286  offset_increment when "offset" option is enabled. By default if "offset"
287  is enabled fio file will get split evenly between fio threads doing the
288  IO. Offset_inc can be used to specify a custom value.
289
290#### Test Combinations
291
292It is possible to specify more than one value for bs, qd and rw parameters.
293In such case script creates a list of their combinations and runs IO tests
294for all of these combinations. For example, the following configuration:
295
296``` ~sh
297  "bs": ["4k"],
298  "qd": [32, 128],
299  "rw": ["write", "read"]
300```
301
302results in following workloads being tested:
303
304- 4k-write-32
305- 4k-write-128
306- 4k-read-32
307- 4k-read-128
308
309#### Important note about queue depth parameter
310
311qd in fio settings section refers to iodepth generated per single fio target
312device ("filename" in resulting fio configuration file). It is re-calculated
313while the script is running, so generated fio configuration file might contain
314a different value than what user has specified at input, especially when also
315using "numjobs" or initiator "num_cores" parameters. For example:
316
317Target system exposes 4 NVMe-oF subsystems. One initiator system connects to
318all of these systems.
319
320Initiator configuration (relevant settings only):
321
322``` ~sh
323"initiator1": {
324  "num_cores": 1
325}
326```
327
328Fio configuration:
329
330``` ~sh
331"fio": {
332  "bs": ["4k"],
333  "qd": [128],
334  "rw": ["randread"],
335  "rwmixread": 100,
336  "num_jobs": 1,
337  "run_time": 30,
338  "ramp_time": 30,
339  "run_num": 1
340}
341```
342
343In this case generated fio configuration will look like this
344(relevant settings only):
345
346``` ~sh
347[global]
348numjobs=1
349
350[job_section0]
351filename=Nvme0n1
352filename=Nvme1n1
353filename=Nvme2n1
354filename=Nvme3n1
355iodepth=512
356```
357
358`num_cores` option results in 4 connected subsystems to be grouped under a
359single fio thread (job_section0). Because `iodepth` is local to `job_section0`,
360it is distributed between each `filename` local to job section in round-robin
361(by default) fashion. In case of fio targets with the same characteristics
362(IOPS & Bandwidth capabilities) it means that iodepth is distributed **roughly**
363equally. Ultimately above fio configuration results in iodepth=128 per filename.
364
365`numjobs` higher than 1 is also taken into account, so that desired qd per
366filename is retained:
367
368``` ~sh
369[global]
370numjobs=2
371
372[job_section0]
373filename=Nvme0n1
374filename=Nvme1n1
375filename=Nvme2n1
376filename=Nvme3n1
377iodepth=256
378```
379
380Besides `run_num`, more information on these options can be found in `man fio`.
381
382## Running the test
383
384Before running the test script run the spdk/scripts/setup.sh script on Target
385system. This binds the devices to VFIO/UIO userspace driver and allocates
386hugepages for SPDK process.
387
388Run the script on the NVMe-oF target system:
389
390``` ~sh
391cd spdk
392sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py
393```
394
395By default script uses config.json configuration file in the scripts/perf/nvmf
396directory. You can specify a different configuration file at runtime as below:
397
398``` ~sh
399sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py -c /path/to/config.json
400```
401
402PYTHONPATH environment variable is needed because script uses SPDK-local Python
403modules. If you'd like to get rid of `PYTHONPATH=$PYTHONPATH:$PWD/python`
404you need to modify your environment so that Python interpreter is aware of
405`spdk/scripts` directory.
406
407## Test Results
408
409Test results for all workload combinations are printed to screen once the tests
410are finished. Additionally all aggregate results are saved to /tmp/results/nvmf_results.conf
411Results directory path can be changed by -r script parameter.
412