1# Running NVMe-OF Performance Test Cases 2 3Scripts contained in this directory are used to run TCP and RDMA benchmark tests, 4that are later published at [spdk.io performance reports section](https://spdk.io/doc/performance_reports.html). 5To run the scripts in your environment please follow steps below. 6 7## Test Systems Requirements 8 9- The OS installed on test systems must be a Linux OS. 10 Scripts were primarily used on systems with Fedora and 11 Ubuntu 18.04/20.04 distributions. 12- Each test system must have at least one RDMA-capable NIC installed for RDMA tests. 13 For TCP tests any TCP-capable NIC will do. However, high-bandwidth, 14 high-performance NICs like Intel E810 CQDA2 or Mellanox ConnectX-5 are 15 suggested because the NVMe-oF workload is network bound. 16 So, if you use a NIC capable of less than 100Gbps on NVMe-oF target 17 system, you will quickly saturate your NICs. 18- Python3 interpreter must be available on all test systems. 19 Paramiko and Pandas modules must be installed. 20- nvmecli package must be installed on all test systems. 21- fio must be downloaded from [Github](https://github.com/axboe/fio) and built. 22 This must be done on Initiator test systems to later build SPDK with 23 "--with-fio" option. 24- All test systems must have a user account with a common name, 25 password and passwordless sudo enabled. 26- [mlnx-tools](https://github.com/Mellanox/mlnx-tools) package must be downloaded 27 to /usr/src/local directory in order to configure NIC ports IRQ affinity. 28 If custom directory is to be used, then it must be set using irq_scripts_dir 29 option in Target and Initiator configuration sections. 30- `sysstat` package must be installed for SAR CPU utilization measurements. 31- `bwm-ng` package must be installed for NIC bandwidth utilization measurements. 32- `pcm` package must be installed for pcm CPU measurements. 33 34### Optional 35 36- For test using the Kernel Target, nvmet-cli must be downloaded and build on Target system. 37 nvmet-cli is available [here](http://git.infradead.org/users/hch/nvmetcli.git). 38 39## Manual configuration 40 41Before running the scripts some manual test systems configuration is required: 42 43- Configure IP address assignment on the NIC ports that will be used for test. 44 Make sure to make these assignments persistent, as in some cases NIC drivers may be reloaded. 45- Adjust firewall service to allow traffic on IP - port pairs used in test 46 (or disable firewall service completely if possible). 47- Adjust or completely disable local security engines like AppArmor or SELinux. 48 49## JSON configuration for test run automation 50 51An example json configuration file with the minimum configuration required 52to automate NVMe-oF testing is provided in this repository. 53The following sub-chapters describe each configuration section in more detail. 54 55### General settings section 56 57``` ~sh 58"general": { 59 "username": "user", 60 "password": "password", 61 "transport": "transport_type", 62 "skip_spdk_install": bool, 63 "irdma_roce_enable": bool, 64 "pause_frames": bool 65} 66``` 67 68Required: 69 70- username - username for the SSH session 71- password - password for the SSH session 72- transport - transport layer to be used throughout the test ("tcp" or "rdma") 73 74Optional: 75 76- skip_spdk_install - by default SPDK sources will be copied from Target 77 to the Initiator systems each time run_nvmf.py script is run. If the SPDK 78 is already in place on Initiator systems and there's no need to re-build it, 79 then set this option to true. 80 Default: false. 81- irdma_roce_enable - loads irdma driver with RoCEv2 network protocol enabled on Target and 82 Initiator machines. This option applies only to system with Intel E810 NICs. 83 Default: false 84- pause_frames - configures pause frames when RoCEv2 network protocol is enabled on Target and 85 Initiator machines. 86 Default: false 87 88### Target System Configuration 89 90``` ~sh 91"target": { 92 "mode": "spdk", 93 "nic_ips": ["192.0.1.1", "192.0.2.1"], 94 "core_mask": "[1-10]", 95 "null_block_devices": 8, 96 "nvmet_bin": "/path/to/nvmetcli", 97 "sar_settings": true, 98 "pcm_settings": false, 99 "enable_bandwidth": [true, 60], 100 "enable_dpdk_memory": true 101 "num_shared_buffers": 4096, 102 "scheduler_settings": "static", 103 "zcopy_settings": false, 104 "dif_insert_strip": true, 105 "null_block_dif_type": 3, 106 "pm_settings": [true, 30, 1, 60], 107 "irq_settings": { 108 "mode": "cpulist", 109 "cpulist": "[0-10]", 110 "exclude_cpulist": false 111 } 112} 113``` 114 115Required: 116 117- mode - Target application mode, "spdk" or "kernel". 118- nic_ips - IP addresses of NIC ports to be used by the target to export 119 NVMe-oF subsystems. 120- core_mask - Used by SPDK target only. 121 CPU core mask either in form of actual mask (i.e. 0xAAAA) or core list 122 (i.e. [0,1,2-5,6). 123 At this moment the scripts cannot restrict the Kernel target to only 124 use certain CPU cores. Important: upper bound of the range is inclusive! 125 126Optional, common: 127 128- null_block_devices - int, number of null block devices to create. 129 Detected NVMe devices are not used if option is present. Default: 0. 130- sar_settings - bool 131 Enable SAR CPU utilization measurement on Target side. SAR thread will 132 wait until fio finishes it's "ramp_time" and then start measurement for 133 fio "run_time" duration. Default: enabled. 134- pcm_settings - bool 135 Enable [PCM](https://github.com/opcm/pcm.git) measurements on Target side. 136 Measurements include only CPU consumption. Default: enabled. 137- enable_bandwidth - bool. Measure bandwidth utilization on network 138 interfaces. Default: enabled. 139- tuned_profile - tunedadm profile to apply on the system before starting 140 the test. 141- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 142 Used to run set_irq_affinity.sh script. 143 Default: /usr/src/local/mlnx-tools/ofed_scripts 144- enable_pm - bool; 145 if bool is set to true, power measurement is enabled via collect-bmc-pm on 146 the target side. Default: true. 147- irq_settings - dict; 148 Choose how to adjust network interface IRQ settings. 149 mode: default - run IRQ alignment script with no additional options. 150 mode: bynode - align IRQs to be processed only on CPU cores matching NIC 151 NUMA node. 152 mode: cpulist - align IRQs to be processed only on CPU cores provided 153 in the cpulist parameter. 154 cpulist: list of CPU cores to use for cpulist mode. Can be provided as 155 list of individual cores ("[0,1,10]"), core ranges ("[0-10]"), or mix 156 of both ("[0-1,10,20-22]") 157 exclude_cpulist: reverse the effect of cpulist mode. Allow IRQ processing 158 only on CPU cores which are not provided in cpulist parameter. 159- sock_impl - str. Specifies the socket implementation to be used. This could be 'posix' for 160 the POSIX socket interfaces, or 'uring' for the Linux io_uring interface. 161 Default: posix 162 163Optional, Kernel Target only: 164 165- nvmet_bin - path to nvmetcli binary, if not available in $PATH. 166 Only for Kernel Target. Default: "nvmetcli". 167 168Optional, SPDK Target only: 169 170- zcopy_settings - bool. Disable or enable target-size zero-copy option. 171 Default: false. 172- scheduler_settings - str. Select SPDK Target thread scheduler (static/dynamic). 173 Default: static. 174- num_shared_buffers - int, number of shared buffers to allocate when 175 creating transport layer. Default: 4096. 176- max_queue_depth - int, max number of outstanding I/O per queue. Default: 128. 177- dif_insert_strip - bool. Only for TCP transport. Enable DIF option when 178 creating transport layer. Default: false. 179- num_cqe - int, number of completion queue entries. See doc/json_rpc.md 180 "nvmf_create_transport" section. Default: 4096. 181- null_block_dif_type - int, 0-3. Level of DIF type to use when creating 182 null block bdev. Default: 0. 183- enable_dpdk_memory - bool. Wait for a fio ramp_time to finish and 184 call env_dpdk_get_mem_stats RPC call to dump DPDK memory stats. 185 Default: enabled. 186- adq_enable - bool; only for TCP transport. 187 Configure system modules, NIC settings and create priority traffic classes 188 for ADQ testing. You need and ADQ-capable NIC like the Intel E810. 189- bpf_scripts - list of bpftrace scripts that will be attached during the 190 test run. Available scripts can be found in the spdk/scripts/bpf directory. 191- dsa_settings - bool. Only for TCP transport. Enable offloading CRC32C 192 calculation to DSA. You need a CPU with the Intel(R) Data Streaming 193 Accelerator (DSA) engine. 194- scheduler_core_limit - int, 0-100. Dynamic scheduler option to load limit on 195 the core to be considered full. 196- irq_settings - dict; 197 Choose how to adjust network interface IRQ settings. 198 Same as in common options section, but SPDK Target allows more modes: 199 mode: shared - align IRQs to be processed only on the same CPU cores which 200 are already used by SPDK Target process. 201 mode: split - align IRQs to be processed only on CPU cores which are not 202 used by SPDK Target process. 203 mode: split-bynode - same as "split", but reduce the number of CPU cores 204 to use for IRQ processing to only these matching NIC NUMA node. 205 206### Initiator system settings section 207 208There can be one or more `initiatorX` setting sections, depending on the test setup. 209 210``` ~sh 211"initiator1": { 212 "ip": "10.0.0.1", 213 "nic_ips": ["192.0.1.2"], 214 "target_nic_ips": ["192.0.1.1"], 215 "mode": "spdk", 216 "fio_bin": "/path/to/fio/bin", 217 "nvmecli_bin": "/path/to/nvmecli/bin", 218 "cpus_allowed": "0,1,10-15", 219 "cpus_allowed_policy": "shared", 220 "num_cores": 4, 221 "cpu_frequency": 2100000, 222 "adq_enable": false, 223 "kernel_engine": "io_uring", 224 "irq_settings": { "mode": "bynode" } 225} 226``` 227 228Required: 229 230- ip - management IP address of initiator system to set up SSH connection. 231- nic_ips - list of IP addresses of NIC ports to be used in test, 232 local to given initiator system. 233- target_nic_ips - list of IP addresses of Target NIC ports to which initiator 234 will attempt to connect to. 235- mode - initiator mode, "spdk" or "kernel". For SPDK, the bdev fio plugin 236 will be used to connect to NVMe-oF subsystems and submit I/O. For "kernel", 237 nvmecli will be used to connect to NVMe-oF subsystems and fio will use the 238 libaio ioengine to submit I/Os. 239 240Optional, common: 241 242- nvmecli_bin - path to nvmecli binary; Will be used for "discovery" command 243 (for both SPDK and Kernel modes) and for "connect" (in case of Kernel mode). 244 Default: system-wide "nvme". 245- fio_bin - path to custom fio binary, which will be used to run IO. 246 Additionally, the directory where the binary is located should also contain 247 fio sources needed to build SPDK fio_plugin for spdk initiator mode. 248 Default: /usr/src/fio/fio. 249- cpus_allowed - str, list of CPU cores to run fio threads on. Takes precedence 250 before `num_cores` setting. Default: None (CPU cores randomly allocated). 251 For more information see `man fio`. 252- cpus_allowed_policy - str, "shared" or "split". CPU sharing policy for fio 253 threads. Default: shared. For more information see `man fio`. 254- num_cores - By default fio threads on initiator side will use as many CPUs 255 as there are connected subsystems. This option limits the number of CPU cores 256 used for fio threads to this number; cores are allocated randomly and fio 257 `filename` parameters are grouped if needed. `cpus_allowed` option takes 258 precedence and `num_cores` is ignored if both are present in config. 259- cpu_frequency - int, custom CPU frequency to set. By default test setups are 260 configured to run in performance mode at max frequencies. This option allows 261 user to select CPU frequency instead of running at max frequency. Before 262 using this option `intel_pstate=disable` must be set in boot options and 263 cpupower governor be set to `userspace`. 264- tuned_profile - tunedadm profile to apply on the system before starting 265 the test. 266- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 267 Used to run set_irq_affinity.sh script. 268 Default: /usr/src/local/mlnx-tools/ofed_scripts 269- kernel_engine - Select fio ioengine mode to run tests. io_uring libraries and 270 io_uring capable fio binaries must be present on Initiator systems! 271 Available options: 272 - libaio (default) 273 - io_uring 274- irq_settings - dict; 275 Same as "irq_settings" in Target common options section. 276 277Optional, SPDK Initiator only: 278 279- adq_enable - bool; only for TCP transport. Configure system modules, NIC 280 settings and create priority traffic classes for ADQ testing. 281 You need an ADQ-capable NIC like Intel E810. 282- enable_data_digest - bool; only for TCP transport. Enable the data 283 digest for the bdev controller. The target can use IDXD to calculate the 284 data digest or fallback to a software optimized implementation on system 285 that don't have the Intel(R) Data Streaming Accelerator (DSA) engine. 286 287### Fio settings section 288 289``` ~sh 290"fio": { 291 "bs": ["4k", "128k"], 292 "qd": [32, 128], 293 "rw": ["randwrite", "write"], 294 "rwmixread": 100, 295 "rate_iops": 10000, 296 "num_jobs": 2, 297 "offset": true, 298 "offset_inc": 10, 299 "run_time": 30, 300 "ramp_time": 30, 301 "run_num": 3 302} 303``` 304 305Required: 306 307- bs - fio IO block size 308- qd - fio iodepth 309- rw - fio rw mode 310- rwmixread - read operations percentage in case of mixed workloads 311- num_jobs - fio numjobs parameter 312 Note: may affect total number of CPU cores used by initiator systems 313- run_time - fio run time 314- ramp_time - fio ramp time, does not do measurements 315- run_num - number of times each workload combination is run. 316 If more than 1 then final result is the average of all runs. 317 318Optional: 319 320- rate_iops - limit IOPS to this number 321- offset - bool; enable offseting of the IO to the file. When this option is 322 enabled the file is "split" into a number of chunks equal to "num_jobs" 323 parameter value, and each "num_jobs" fio thread gets it's own chunk to 324 work with. 325 For more detail see "offset", "offset_increment" and "size" in fio man 326 pages. Default: false. 327- offset_inc - int; Percentage value determining the offset, size and 328 offset_increment when "offset" option is enabled. By default if "offset" 329 is enabled fio file will get split evenly between fio threads doing the 330 IO. Offset_inc can be used to specify a custom value. 331 332#### Test Combinations 333 334It is possible to specify more than one value for bs, qd and rw parameters. 335In such case script creates a list of their combinations and runs IO tests 336for all of these combinations. For example, the following configuration: 337 338``` ~sh 339 "bs": ["4k"], 340 "qd": [32, 128], 341 "rw": ["write", "read"] 342``` 343 344results in following workloads being tested: 345 346- 4k-write-32 347- 4k-write-128 348- 4k-read-32 349- 4k-read-128 350 351#### Important note about queue depth parameter 352 353qd in fio settings section refers to iodepth generated per single fio target 354device ("filename" in resulting fio configuration file). It is re-calculated 355while the script is running, so generated fio configuration file might contain 356a different value than what user has specified at input, especially when also 357using "numjobs" or initiator "num_cores" parameters. For example: 358 359Target system exposes 4 NVMe-oF subsystems. One initiator system connects to 360all of these systems. 361 362Initiator configuration (relevant settings only): 363 364``` ~sh 365"initiator1": { 366 "num_cores": 1 367} 368``` 369 370Fio configuration: 371 372``` ~sh 373"fio": { 374 "bs": ["4k"], 375 "qd": [128], 376 "rw": ["randread"], 377 "rwmixread": 100, 378 "num_jobs": 1, 379 "run_time": 30, 380 "ramp_time": 30, 381 "run_num": 1 382} 383``` 384 385In this case generated fio configuration will look like this 386(relevant settings only): 387 388``` ~sh 389[global] 390numjobs=1 391 392[job_section0] 393filename=Nvme0n1 394filename=Nvme1n1 395filename=Nvme2n1 396filename=Nvme3n1 397iodepth=512 398``` 399 400`num_cores` option results in 4 connected subsystems to be grouped under a 401single fio thread (job_section0). Because `iodepth` is local to `job_section0`, 402it is distributed between each `filename` local to job section in round-robin 403(by default) fashion. In case of fio targets with the same characteristics 404(IOPS & Bandwidth capabilities) it means that iodepth is distributed **roughly** 405equally. Ultimately above fio configuration results in iodepth=128 per filename. 406 407`numjobs` higher than 1 is also taken into account, so that desired qd per 408filename is retained: 409 410``` ~sh 411[global] 412numjobs=2 413 414[job_section0] 415filename=Nvme0n1 416filename=Nvme1n1 417filename=Nvme2n1 418filename=Nvme3n1 419iodepth=256 420``` 421 422Besides `run_num`, more information on these options can be found in `man fio`. 423 424## Running the test 425 426Before running the test script run the spdk/scripts/setup.sh script on Target 427system. This binds the devices to VFIO/UIO userspace driver and allocates 428hugepages for SPDK process. 429 430Run the script on the NVMe-oF target system: 431 432``` ~sh 433cd spdk 434sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py 435``` 436 437By default script uses config.json configuration file in the scripts/perf/nvmf 438directory. You can specify a different configuration file at runtime as below: 439 440``` ~sh 441sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py -c /path/to/config.json 442``` 443 444PYTHONPATH environment variable is needed because script uses SPDK-local Python 445modules. If you'd like to get rid of `PYTHONPATH=$PYTHONPATH:$PWD/python` 446you need to modify your environment so that Python interpreter is aware of 447`spdk/scripts` directory. 448 449## Test Results 450 451Test results for all workload combinations are printed to screen once the tests 452are finished. Additionally all aggregate results are saved to /tmp/results/nvmf_results.conf 453Results directory path can be changed by -r script parameter. 454