xref: /spdk/scripts/perf/nvmf/README.md (revision 12fbe739a31b09aff0d05f354d4f3bbef99afc55)
1# Running NVMe-OF Performance Test Cases
2
3Scripts contained in this directory are used to run TCP and RDMA benchmark tests,
4that are later published at [spdk.io performance reports section](https://spdk.io/doc/performance_reports.html).
5To run the scripts in your environment please follow steps below.
6
7## Test Systems Requirements
8
9- The OS installed on test systems must be a Linux OS.
10  Scripts were primarily used on systems with Fedora and
11  Ubuntu 18.04/20.04 distributions.
12- Each test system must have at least one RDMA-capable NIC installed for RDMA tests.
13  For TCP tests any TCP-capable NIC will do. However, high-bandwidth,
14  high-performance NICs like Intel E810 CQDA2 or Mellanox ConnectX-5 are
15  suggested because the NVMe-oF workload is network bound.
16  So, if you use a NIC capable of less than 100Gbps on NVMe-oF target
17  system, you will quickly saturate your NICs.
18- Python3 interpreter must be available on all test systems.
19  Paramiko and Pandas modules must be installed.
20- nvmecli package must be installed on all test systems.
21- fio must be downloaded from [Github](https://github.com/axboe/fio) and built.
22  This must be done on Initiator test systems to later build SPDK with
23  "--with-fio" option.
24- All test systems must have a user account with a common name,
25  password and passwordless sudo enabled.
26- [mlnx-tools](https://github.com/Mellanox/mlnx-tools) package must be downloaded
27  to /usr/src/local directory in order to configure NIC ports IRQ affinity.
28  If custom directory is to be used, then it must be set using irq_scripts_dir
29  option in Target and Initiator configuration sections.
30- `sysstat` package must be installed for SAR CPU utilization measurements.
31- `bwm-ng` package must be installed for NIC bandwidth utilization measurements.
32- `pcm` package must be installed for pcm CPU measurements.
33
34### Optional
35
36- For test using the Kernel Target, nvmet-cli must be downloaded and build on Target system.
37  nvmet-cli is available [here](http://git.infradead.org/users/hch/nvmetcli.git).
38
39## Manual configuration
40
41Before running the scripts some manual test systems configuration is required:
42
43- Configure IP address assignment on the NIC ports that will be used for test.
44  Make sure to make these assignments persistent, as in some cases NIC drivers may be reloaded.
45- Adjust firewall service to allow traffic on IP - port pairs used in test
46  (or disable firewall service completely if possible).
47- Adjust or completely disable local security engines like AppArmor or SELinux.
48
49## JSON configuration for test run automation
50
51An example json configuration file with the minimum configuration required
52to automate NVMe-oF testing is provided in this repository.
53The following sub-chapters describe each configuration section in more detail.
54
55### General settings section
56
57``` ~sh
58"general": {
59    "username": "user",
60    "password": "password",
61    "transport": "transport_type",
62    "skip_spdk_install": bool,
63    "irdma_roce_enable": bool,
64    "pause_frames": bool
65}
66```
67
68Required:
69
70- username - username for the SSH session
71- password - password for the SSH session
72- transport - transport layer to be used throughout the test ("tcp" or "rdma")
73
74Optional:
75
76- skip_spdk_install - by default SPDK sources will be copied from Target
77  to the Initiator systems each time run_nvmf.py script is run. If the SPDK
78  is already in place on Initiator systems and there's no need to re-build it,
79  then set this option to true.
80  Default: false.
81- irdma_roce_enable - loads irdma driver with RoCEv2 network protocol enabled on Target and
82  Initiator machines. This option applies only to system with Intel E810 NICs.
83  Default: false
84- pause_frames - configures pause frames when RoCEv2 network protocol is enabled on Target and
85  Initiator machines.
86  Default: false
87
88### Target System Configuration
89
90``` ~sh
91"target": {
92  "mode": "spdk",
93  "nic_ips": ["192.0.1.1", "192.0.2.1"],
94  "core_mask": "[1-10]",
95  "null_block_devices": 8,
96  "nvmet_bin": "/path/to/nvmetcli",
97  "sar_settings": true,
98  "pcm_settings": false,
99  "enable_bandwidth": [true, 60],
100  "enable_dpdk_memory": true
101  "num_shared_buffers": 4096,
102  "scheduler_settings": "static",
103  "zcopy_settings": false,
104  "dif_insert_strip": true,
105  "null_block_dif_type": 3,
106  "pm_settings": [true, 30, 1, 60],
107  "irq_settings": {
108    "mode": "cpulist",
109    "cpulist": "[0-10]",
110    "exclude_cpulist": false
111  }
112}
113```
114
115Required:
116
117- mode - Target application mode, "spdk" or "kernel".
118- nic_ips - IP addresses of NIC ports to be used by the target to export
119  NVMe-oF subsystems.
120- core_mask - Used by SPDK target only.
121  CPU core mask either in form of actual mask (i.e. 0xAAAA) or core list
122  (i.e. [0,1,2-5,6).
123  At this moment the scripts cannot restrict the Kernel target to only
124  use certain CPU cores. Important: upper bound of the range is inclusive!
125
126Optional, common:
127
128- null_block_devices - int, number of null block devices to create.
129  Detected NVMe devices are not used if option is present. Default: 0.
130- sar_settings - bool
131  Enable SAR CPU utilization measurement on Target side. SAR thread will
132  wait until fio finishes it's "ramp_time" and then start measurement for
133  fio "run_time" duration. Default: enabled.
134- pcm_settings - bool
135  Enable [PCM](https://github.com/opcm/pcm.git) measurements on Target side.
136  Measurements include only CPU consumption. Default: enabled.
137- enable_bandwidth - bool. Measure bandwidth utilization on network
138  interfaces. Default: enabled.
139- tuned_profile - tunedadm profile to apply on the system before starting
140  the test.
141- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package;
142  Used to run set_irq_affinity.sh script.
143  Default: /usr/src/local/mlnx-tools/ofed_scripts
144- enable_pm - bool;
145  if bool is set to true, power measurement is enabled via collect-bmc-pm on
146  the target side. Default: true.
147- irq_settings - dict;
148  Choose how to adjust network interface IRQ settings.
149  mode: default - run IRQ alignment script with no additional options.
150  mode: bynode - align IRQs to be processed only on CPU cores matching NIC
151    NUMA node.
152  mode: cpulist - align IRQs to be processed only on CPU cores provided
153    in the cpulist parameter.
154  cpulist: list of CPU cores to use for cpulist mode. Can be provided as
155    list of individual cores ("[0,1,10]"), core ranges ("[0-10]"), or mix
156    of both ("[0-1,10,20-22]")
157  exclude_cpulist: reverse the effect of cpulist mode. Allow IRQ processing
158    only on CPU cores which are not provided in cpulist parameter.
159- sock_impl - str. Specifies the socket implementation to be used. This could be 'posix' for
160  the POSIX socket interfaces, or 'uring' for the Linux io_uring interface.
161  Default: posix
162
163Optional, Kernel Target only:
164
165- nvmet_bin - path to nvmetcli binary, if not available in $PATH.
166  Only for Kernel Target. Default: "nvmetcli".
167
168Optional, SPDK Target only:
169
170- zcopy_settings - bool. Disable or enable target-size zero-copy option.
171  Default: false.
172- scheduler_settings - str. Select SPDK Target thread scheduler (static/dynamic).
173  Default: static.
174- num_shared_buffers - int, number of shared buffers to allocate when
175  creating transport layer. Default: 4096.
176- max_queue_depth - int, max number of outstanding I/O per queue. Default: 128.
177- dif_insert_strip - bool. Only for TCP transport. Enable DIF option when
178  creating transport layer. Default: false.
179- num_cqe - int, number of completion queue entries. See doc/json_rpc.md
180  "nvmf_create_transport" section. Default: 4096.
181- null_block_dif_type - int, 0-3. Level of DIF type to use when creating
182  null block bdev. Default: 0.
183- enable_dpdk_memory - bool. Wait for a fio ramp_time to finish and
184  call env_dpdk_get_mem_stats RPC call to dump DPDK memory stats.
185  Default: enabled.
186- adq_enable - bool; only for TCP transport.
187  Configure system modules, NIC settings and create priority traffic classes
188  for ADQ testing. You need and ADQ-capable NIC like the Intel E810.
189- bpf_scripts - list of bpftrace scripts that will be attached during the
190  test run. Available scripts can be found in the spdk/scripts/bpf directory.
191- dsa_settings - bool. Only for TCP transport. Enable offloading CRC32C
192  calculation to DSA. You need a CPU with the Intel(R) Data Streaming
193  Accelerator (DSA) engine.
194- scheduler_core_limit - int, 0-100. Dynamic scheduler option to load limit on
195  the core to be considered full.
196- irq_settings - dict;
197  Choose how to adjust network interface IRQ settings.
198  Same as in common options section, but SPDK Target allows more modes:
199  mode: shared - align IRQs to be processed only on the same CPU cores which
200    are already used by SPDK Target process.
201  mode: split - align IRQs to be processed only on CPU cores which are not
202    used by SPDK Target process.
203  mode: split-bynode - same as "split", but reduce the number of CPU cores
204    to use for IRQ processing to only these matching NIC NUMA node.
205
206### Initiator system settings section
207
208There can be one or more `initiatorX` setting sections, depending on the test setup.
209
210``` ~sh
211"initiator1": {
212  "ip": "10.0.0.1",
213  "nic_ips": ["192.0.1.2"],
214  "target_nic_ips": ["192.0.1.1"],
215  "mode": "spdk",
216  "fio_bin": "/path/to/fio/bin",
217  "nvmecli_bin": "/path/to/nvmecli/bin",
218  "cpus_allowed": "0,1,10-15",
219  "cpus_allowed_policy": "shared",
220  "num_cores": 4,
221  "cpu_frequency": 2100000,
222  "adq_enable": false,
223  "kernel_engine": "io_uring",
224  "irq_settings": { "mode": "bynode" }
225}
226```
227
228Required:
229
230- ip - management IP address of initiator system to set up SSH connection.
231- nic_ips - list of IP addresses of NIC ports to be used in test,
232  local to given initiator system.
233- target_nic_ips - list of IP addresses of Target NIC ports to which initiator
234  will attempt to connect to.
235- mode - initiator mode, "spdk" or "kernel". For SPDK, the bdev fio plugin
236  will be used to connect to NVMe-oF subsystems and submit I/O. For "kernel",
237  nvmecli will be used to connect to NVMe-oF subsystems and fio will use the
238  libaio ioengine to submit I/Os.
239
240Optional, common:
241
242- nvmecli_bin - path to nvmecli binary; Will be used for "discovery" command
243  (for both SPDK and Kernel modes) and for "connect" (in case of Kernel mode).
244  Default: system-wide "nvme".
245- fio_bin - path to custom fio binary, which will be used to run IO.
246  Additionally, the directory where the binary is located should also contain
247  fio sources needed to build SPDK fio_plugin for spdk initiator mode.
248  Default: /usr/src/fio/fio.
249- cpus_allowed - str, list of CPU cores to run fio threads on. Takes precedence
250  before `num_cores` setting. Default: None (CPU cores randomly allocated).
251  For more information see `man fio`.
252- cpus_allowed_policy - str, "shared" or "split". CPU sharing policy for fio
253  threads. Default: shared. For more information see `man fio`.
254- num_cores - By default fio threads on initiator side will use as many CPUs
255  as there are connected subsystems. This option limits the number of CPU cores
256  used for fio threads to this number; cores are allocated randomly and fio
257  `filename` parameters are grouped if needed. `cpus_allowed` option takes
258  precedence and `num_cores` is ignored if both are present in config.
259- cpu_frequency - int, custom CPU frequency to set. By default test setups are
260  configured to run in performance mode at max frequencies. This option allows
261  user to select CPU frequency instead of running at max frequency. Before
262  using this option `intel_pstate=disable` must be set in boot options and
263  cpupower governor be set to `userspace`.
264- tuned_profile - tunedadm profile to apply on the system before starting
265  the test.
266- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package;
267  Used to run set_irq_affinity.sh script.
268  Default: /usr/src/local/mlnx-tools/ofed_scripts
269- kernel_engine - Select fio ioengine mode to run tests. io_uring libraries and
270  io_uring capable fio binaries must be present on Initiator systems!
271  Available options:
272  - libaio (default)
273  - io_uring
274- irq_settings - dict;
275  Same as "irq_settings" in Target common options section.
276
277Optional, SPDK Initiator only:
278
279- adq_enable - bool; only for TCP transport. Configure system modules, NIC
280  settings and create priority traffic classes for ADQ testing.
281  You need an ADQ-capable NIC like Intel E810.
282- enable_data_digest - bool; only for TCP transport. Enable the data
283  digest for the bdev controller. The target can use IDXD to calculate the
284  data digest or fallback to a software optimized implementation on system
285  that don't have the Intel(R) Data Streaming Accelerator (DSA) engine.
286
287### Fio settings section
288
289``` ~sh
290"fio": {
291  "bs": ["4k", "128k"],
292  "qd": [32, 128],
293  "rw": ["randwrite", "write"],
294  "rwmixread": 100,
295  "rate_iops": 10000,
296  "num_jobs": 2,
297  "offset": true,
298  "offset_inc": 10,
299  "run_time": 30,
300  "ramp_time": 30,
301  "run_num": 3
302}
303```
304
305Required:
306
307- bs - fio IO block size
308- qd -  fio iodepth
309- rw - fio rw mode
310- rwmixread - read operations percentage in case of mixed workloads
311- num_jobs - fio numjobs parameter
312  Note: may affect total number of CPU cores used by initiator systems
313- run_time - fio run time
314- ramp_time - fio ramp time, does not do measurements
315- run_num - number of times each workload combination is run.
316  If more than 1 then final result is the average of all runs.
317
318Optional:
319
320- rate_iops - limit IOPS to this number
321- offset - bool; enable offseting of the IO to the file. When this option is
322  enabled the file is "split" into a number of chunks equal to "num_jobs"
323  parameter value, and each "num_jobs" fio thread gets it's own chunk to
324  work with.
325  For more detail see "offset", "offset_increment" and "size" in fio man
326  pages. Default: false.
327- offset_inc - int; Percentage value determining the offset, size and
328  offset_increment when "offset" option is enabled. By default if "offset"
329  is enabled fio file will get split evenly between fio threads doing the
330  IO. Offset_inc can be used to specify a custom value.
331
332#### Test Combinations
333
334It is possible to specify more than one value for bs, qd and rw parameters.
335In such case script creates a list of their combinations and runs IO tests
336for all of these combinations. For example, the following configuration:
337
338``` ~sh
339  "bs": ["4k"],
340  "qd": [32, 128],
341  "rw": ["write", "read"]
342```
343
344results in following workloads being tested:
345
346- 4k-write-32
347- 4k-write-128
348- 4k-read-32
349- 4k-read-128
350
351#### Important note about queue depth parameter
352
353qd in fio settings section refers to iodepth generated per single fio target
354device ("filename" in resulting fio configuration file). It is re-calculated
355while the script is running, so generated fio configuration file might contain
356a different value than what user has specified at input, especially when also
357using "numjobs" or initiator "num_cores" parameters. For example:
358
359Target system exposes 4 NVMe-oF subsystems. One initiator system connects to
360all of these systems.
361
362Initiator configuration (relevant settings only):
363
364``` ~sh
365"initiator1": {
366  "num_cores": 1
367}
368```
369
370Fio configuration:
371
372``` ~sh
373"fio": {
374  "bs": ["4k"],
375  "qd": [128],
376  "rw": ["randread"],
377  "rwmixread": 100,
378  "num_jobs": 1,
379  "run_time": 30,
380  "ramp_time": 30,
381  "run_num": 1
382}
383```
384
385In this case generated fio configuration will look like this
386(relevant settings only):
387
388``` ~sh
389[global]
390numjobs=1
391
392[job_section0]
393filename=Nvme0n1
394filename=Nvme1n1
395filename=Nvme2n1
396filename=Nvme3n1
397iodepth=512
398```
399
400`num_cores` option results in 4 connected subsystems to be grouped under a
401single fio thread (job_section0). Because `iodepth` is local to `job_section0`,
402it is distributed between each `filename` local to job section in round-robin
403(by default) fashion. In case of fio targets with the same characteristics
404(IOPS & Bandwidth capabilities) it means that iodepth is distributed **roughly**
405equally. Ultimately above fio configuration results in iodepth=128 per filename.
406
407`numjobs` higher than 1 is also taken into account, so that desired qd per
408filename is retained:
409
410``` ~sh
411[global]
412numjobs=2
413
414[job_section0]
415filename=Nvme0n1
416filename=Nvme1n1
417filename=Nvme2n1
418filename=Nvme3n1
419iodepth=256
420```
421
422Besides `run_num`, more information on these options can be found in `man fio`.
423
424## Running the test
425
426Before running the test script run the spdk/scripts/setup.sh script on Target
427system. This binds the devices to VFIO/UIO userspace driver and allocates
428hugepages for SPDK process.
429
430Run the script on the NVMe-oF target system:
431
432``` ~sh
433cd spdk
434sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py
435```
436
437By default script uses config.json configuration file in the scripts/perf/nvmf
438directory. You can specify a different configuration file at runtime as below:
439
440``` ~sh
441sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py -c /path/to/config.json
442```
443
444PYTHONPATH environment variable is needed because script uses SPDK-local Python
445modules. If you'd like to get rid of `PYTHONPATH=$PYTHONPATH:$PWD/python`
446you need to modify your environment so that Python interpreter is aware of
447`spdk/scripts` directory.
448
449## Test Results
450
451Test results for all workload combinations are printed to screen once the tests
452are finished. Additionally all aggregate results are saved to /tmp/results/nvmf_results.conf
453Results directory path can be changed by -r script parameter.
454