1# Running NVMe-OF Performance Test Cases 2 3Scripts contained in this directory are used to run TCP and RDMA benchmark tests, 4that are later published at [spdk.io performance reports section](https://spdk.io/doc/performance_reports.html). 5To run the scripts in your environment please follow steps below. 6 7## Test Systems Requirements 8 9- The OS installed on test systems must be a Linux OS. 10 Scripts were primarily used on systems with Fedora and 11 Ubuntu 18.04/20.04 distributions. 12- Each test system must have at least one RDMA-capable NIC installed for RDMA tests. 13 For TCP tests any TCP-capable NIC will do. However, high-bandwidth, 14 high-performance NICs like Intel E810 CQDA2 or Mellanox ConnectX-5 are 15 suggested because the NVMe-oF workload is network bound. 16 So, if you use a NIC capable of less than 100Gbps on NVMe-oF target 17 system, you will quickly saturate your NICs. 18- Python3 interpreter must be available on all test systems. 19 Paramiko and Pandas modules must be installed. 20- nvmecli package must be installed on all test systems. 21- fio must be downloaded from [Github](https://github.com/axboe/fio) and built. 22 This must be done on Initiator test systems to later build SPDK with 23 "--with-fio" option. 24- All test systems must have a user account with a common name, 25 password and passwordless sudo enabled. 26- [mlnx-tools](https://github.com/Mellanox/mlnx-tools) package must be downloaded 27 to /usr/src/local directory in order to configure NIC ports IRQ affinity. 28 If custom directory is to be used, then it must be set using irq_scripts_dir 29 option in Target and Initiator configuration sections. 30- `sysstat` package must be installed for SAR CPU utilization measurements. 31- `bwm-ng` package must be installed for NIC bandwidth utilization measurements. 32- `pcm` package must be installed for pcm, pcm-power and pcm-memory measurements. 33 34### Optional 35 36- For test using the Kernel Target, nvmet-cli must be downloaded and build on Target system. 37 nvmet-cli is available [here](http://git.infradead.org/users/hch/nvmetcli.git). 38 39## Manual configuration 40 41Before running the scripts some manual test systems configuration is required: 42 43- Configure IP address assignment on the NIC ports that will be used for test. 44 Make sure to make these assignments persistent, as in some cases NIC drivers may be reloaded. 45- Adjust firewall service to allow traffic on IP - port pairs used in test 46 (or disable firewall service completely if possible). 47- Adjust or completely disable local security engines like AppArmor or SELinux. 48 49## JSON configuration for test run automation 50 51An example json configuration file with the minimum configuration required 52to automate NVMe-oF testing is provided in this repository. 53The following sub-chapters describe each configuration section in more detail. 54 55### General settings section 56 57``` ~sh 58"general": { 59 "username": "user", 60 "password": "password", 61 "transport": "transport_type", 62 "skip_spdk_install": bool 63} 64``` 65 66Required: 67 68- username - username for the SSH session 69- password - password for the SSH session 70- transport - transport layer to be used throughout the test ("tcp" or "rdma") 71 72Optional: 73 74- skip_spdk_install - by default SPDK sources will be copied from Target 75 to the Initiator systems each time run_nvmf.py script is run. If the SPDK 76 is already in place on Initiator systems and there's no need to re-build it, 77 then set this option to true. 78 Default: false. 79 80### Target System Configuration 81 82``` ~sh 83"target": { 84 "mode": "spdk", 85 "nic_ips": ["192.0.1.1", "192.0.2.1"], 86 "core_mask": "[1-10]", 87 "null_block_devices": 8, 88 "nvmet_bin": "/path/to/nvmetcli", 89 "sar_settings": true, 90 "pcm_settings": false, 91 "enable_bandwidth": [true, 60], 92 "enable_dpdk_memory": true 93 "num_shared_buffers": 4096, 94 "scheduler_settings": "static", 95 "zcopy_settings": false, 96 "dif_insert_strip": true, 97 "null_block_dif_type": 3, 98 "pm_settings": [true, 30, 1, 60] 99} 100``` 101 102Required: 103 104- mode - Target application mode, "spdk" or "kernel". 105- nic_ips - IP addresses of NIC ports to be used by the target to export 106 NVMe-oF subsystems. 107- core_mask - Used by SPDK target only. 108 CPU core mask either in form of actual mask (i.e. 0xAAAA) or core list 109 (i.e. [0,1,2-5,6). 110 At this moment the scripts cannot restrict the Kernel target to only 111 use certain CPU cores. Important: upper bound of the range is inclusive! 112 113Optional, common: 114 115- null_block_devices - int, number of null block devices to create. 116 Detected NVMe devices are not used if option is present. Default: 0. 117- sar_settings - bool 118 Enable SAR CPU utilization measurement on Target side. SAR thread will 119 wait until fio finishes it's "ramp_time" and then start measurement for 120 fio "run_time" duration. Default: enabled. 121- pcm_settings - bool 122 Enable [PCM](https://github.com/opcm/pcm.git) measurements on Target side. 123 Measurements include CPU, memory and power consumption. Default: enabled. 124- enable_bandwidth - bool. Measure bandwidth utilization on network 125 interfaces. Default: enabled. 126- tuned_profile - tunedadm profile to apply on the system before starting 127 the test. 128- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 129 Used to run set_irq_affinity.sh script. 130 Default: /usr/src/local/mlnx-tools/ofed_scripts 131- enable_pm - bool; 132 if bool is set to true, power measurement is enabled via collect-bmc-pm on 133 the target side. Default: true. 134 135Optional, Kernel Target only: 136 137- nvmet_bin - path to nvmetcli binary, if not available in $PATH. 138 Only for Kernel Target. Default: "nvmetcli". 139 140Optional, SPDK Target only: 141 142- zcopy_settings - bool. Disable or enable target-size zero-copy option. 143 Default: false. 144- scheduler_settings - str. Select SPDK Target thread scheduler (static/dynamic). 145 Default: static. 146- num_shared_buffers - int, number of shared buffers to allocate when 147 creating transport layer. Default: 4096. 148- max_queue_depth - int, max number of outstanding I/O per queue. Default: 128. 149- dif_insert_strip - bool. Only for TCP transport. Enable DIF option when 150 creating transport layer. Default: false. 151- null_block_dif_type - int, 0-3. Level of DIF type to use when creating 152 null block bdev. Default: 0. 153- enable_dpdk_memory - bool. Wait for a fio ramp_time to finish and 154 call env_dpdk_get_mem_stats RPC call to dump DPDK memory stats. 155 Default: enabled. 156- adq_enable - bool; only for TCP transport. 157 Configure system modules, NIC settings and create priority traffic classes 158 for ADQ testing. You need and ADQ-capable NIC like the Intel E810. 159- bpf_scripts - list of bpftrace scripts that will be attached during the 160 test run. Available scripts can be found in the spdk/scripts/bpf directory. 161- dsa_settings - bool. Only for TCP transport. Enable offloading CRC32C 162 calculation to DSA. You need a CPU with the Intel(R) Data Streaming 163 Accelerator (DSA) engine. 164- scheduler_core_limit - int, 0-100. Dynamic scheduler option to load limit on 165 the core to be considered full. 166 167### Initiator system settings section 168 169There can be one or more `initiatorX` setting sections, depending on the test setup. 170 171``` ~sh 172"initiator1": { 173 "ip": "10.0.0.1", 174 "nic_ips": ["192.0.1.2"], 175 "target_nic_ips": ["192.0.1.1"], 176 "mode": "spdk", 177 "fio_bin": "/path/to/fio/bin", 178 "nvmecli_bin": "/path/to/nvmecli/bin", 179 "cpus_allowed": "0,1,10-15", 180 "cpus_allowed_policy": "shared", 181 "num_cores": 4, 182 "cpu_frequency": 2100000, 183 "adq_enable": false, 184 "kernel_engine": "io_uring" 185} 186``` 187 188Required: 189 190- ip - management IP address of initiator system to set up SSH connection. 191- nic_ips - list of IP addresses of NIC ports to be used in test, 192 local to given initiator system. 193- target_nic_ips - list of IP addresses of Target NIC ports to which initiator 194 will attempt to connect to. 195- mode - initiator mode, "spdk" or "kernel". For SPDK, the bdev fio plugin 196 will be used to connect to NVMe-oF subsystems and submit I/O. For "kernel", 197 nvmecli will be used to connect to NVMe-oF subsystems and fio will use the 198 libaio ioengine to submit I/Os. 199 200Optional, common: 201 202- nvmecli_bin - path to nvmecli binary; Will be used for "discovery" command 203 (for both SPDK and Kernel modes) and for "connect" (in case of Kernel mode). 204 Default: system-wide "nvme". 205- fio_bin - path to custom fio binary, which will be used to run IO. 206 Additionally, the directory where the binary is located should also contain 207 fio sources needed to build SPDK fio_plugin for spdk initiator mode. 208 Default: /usr/src/fio/fio. 209- cpus_allowed - str, list of CPU cores to run fio threads on. Takes precedence 210 before `num_cores` setting. Default: None (CPU cores randomly allocated). 211 For more information see `man fio`. 212- cpus_allowed_policy - str, "shared" or "split". CPU sharing policy for fio 213 threads. Default: shared. For more information see `man fio`. 214- num_cores - By default fio threads on initiator side will use as many CPUs 215 as there are connected subsystems. This option limits the number of CPU cores 216 used for fio threads to this number; cores are allocated randomly and fio 217 `filename` parameters are grouped if needed. `cpus_allowed` option takes 218 precedence and `num_cores` is ignored if both are present in config. 219- cpu_frequency - int, custom CPU frequency to set. By default test setups are 220 configured to run in performance mode at max frequencies. This option allows 221 user to select CPU frequency instead of running at max frequency. Before 222 using this option `intel_pstate=disable` must be set in boot options and 223 cpupower governor be set to `userspace`. 224- tuned_profile - tunedadm profile to apply on the system before starting 225 the test. 226- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 227 Used to run set_irq_affinity.sh script. 228 Default: /usr/src/local/mlnx-tools/ofed_scripts 229- kernel_engine - Select fio ioengine mode to run tests. io_uring libraries and 230 io_uring capable fio binaries must be present on Initiator systems! 231 Available options: 232 - libaio (default) 233 - io_uring 234 235Optional, SPDK Initiator only: 236 237- adq_enable - bool; only for TCP transport. Configure system modules, NIC 238 settings and create priority traffic classes for ADQ testing. 239 You need an ADQ-capable NIC like Intel E810. 240- enable_data_digest - bool; only for TCP transport. Enable the data 241 digest for the bdev controller. The target can use IDXD to calculate the 242 data digest or fallback to a software optimized implementation on system 243 that don't have the Intel(R) Data Streaming Accelerator (DSA) engine. 244 245### Fio settings section 246 247``` ~sh 248"fio": { 249 "bs": ["4k", "128k"], 250 "qd": [32, 128], 251 "rw": ["randwrite", "write"], 252 "rwmixread": 100, 253 "rate_iops": 10000, 254 "num_jobs": 2, 255 "offset": true, 256 "offset_inc": 10, 257 "run_time": 30, 258 "ramp_time": 30, 259 "run_num": 3 260} 261``` 262 263Required: 264 265- bs - fio IO block size 266- qd - fio iodepth 267- rw - fio rw mode 268- rwmixread - read operations percentage in case of mixed workloads 269- num_jobs - fio numjobs parameter 270 Note: may affect total number of CPU cores used by initiator systems 271- run_time - fio run time 272- ramp_time - fio ramp time, does not do measurements 273- run_num - number of times each workload combination is run. 274 If more than 1 then final result is the average of all runs. 275 276Optional: 277 278- rate_iops - limit IOPS to this number 279- offset - bool; enable offseting of the IO to the file. When this option is 280 enabled the file is "split" into a number of chunks equal to "num_jobs" 281 parameter value, and each "num_jobs" fio thread gets it's own chunk to 282 work with. 283 For more detail see "offset", "offset_increment" and "size" in fio man 284 pages. Default: false. 285- offset_inc - int; Percentage value determining the offset, size and 286 offset_increment when "offset" option is enabled. By default if "offset" 287 is enabled fio file will get split evenly between fio threads doing the 288 IO. Offset_inc can be used to specify a custom value. 289 290#### Test Combinations 291 292It is possible to specify more than one value for bs, qd and rw parameters. 293In such case script creates a list of their combinations and runs IO tests 294for all of these combinations. For example, the following configuration: 295 296``` ~sh 297 "bs": ["4k"], 298 "qd": [32, 128], 299 "rw": ["write", "read"] 300``` 301 302results in following workloads being tested: 303 304- 4k-write-32 305- 4k-write-128 306- 4k-read-32 307- 4k-read-128 308 309#### Important note about queue depth parameter 310 311qd in fio settings section refers to iodepth generated per single fio target 312device ("filename" in resulting fio configuration file). It is re-calculated 313while the script is running, so generated fio configuration file might contain 314a different value than what user has specified at input, especially when also 315using "numjobs" or initiator "num_cores" parameters. For example: 316 317Target system exposes 4 NVMe-oF subsystems. One initiator system connects to 318all of these systems. 319 320Initiator configuration (relevant settings only): 321 322``` ~sh 323"initiator1": { 324 "num_cores": 1 325} 326``` 327 328Fio configuration: 329 330``` ~sh 331"fio": { 332 "bs": ["4k"], 333 "qd": [128], 334 "rw": ["randread"], 335 "rwmixread": 100, 336 "num_jobs": 1, 337 "run_time": 30, 338 "ramp_time": 30, 339 "run_num": 1 340} 341``` 342 343In this case generated fio configuration will look like this 344(relevant settings only): 345 346``` ~sh 347[global] 348numjobs=1 349 350[job_section0] 351filename=Nvme0n1 352filename=Nvme1n1 353filename=Nvme2n1 354filename=Nvme3n1 355iodepth=512 356``` 357 358`num_cores` option results in 4 connected subsystems to be grouped under a 359single fio thread (job_section0). Because `iodepth` is local to `job_section0`, 360it is distributed between each `filename` local to job section in round-robin 361(by default) fashion. In case of fio targets with the same characteristics 362(IOPS & Bandwidth capabilities) it means that iodepth is distributed **roughly** 363equally. Ultimately above fio configuration results in iodepth=128 per filename. 364 365`numjobs` higher than 1 is also taken into account, so that desired qd per 366filename is retained: 367 368``` ~sh 369[global] 370numjobs=2 371 372[job_section0] 373filename=Nvme0n1 374filename=Nvme1n1 375filename=Nvme2n1 376filename=Nvme3n1 377iodepth=256 378``` 379 380Besides `run_num`, more information on these options can be found in `man fio`. 381 382## Running the test 383 384Before running the test script run the spdk/scripts/setup.sh script on Target 385system. This binds the devices to VFIO/UIO userspace driver and allocates 386hugepages for SPDK process. 387 388Run the script on the NVMe-oF target system: 389 390``` ~sh 391cd spdk 392sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py 393``` 394 395By default script uses config.json configuration file in the scripts/perf/nvmf 396directory. You can specify a different configuration file at runtime as below: 397 398``` ~sh 399sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py -c /path/to/config.json 400``` 401 402PYTHONPATH environment variable is needed because script uses SPDK-local Python 403modules. If you'd like to get rid of `PYTHONPATH=$PYTHONPATH:$PWD/python` 404you need to modify your environment so that Python interpreter is aware of 405`spdk/scripts` directory. 406 407## Test Results 408 409Test results for all workload combinations are printed to screen once the tests 410are finished. Additionally all aggregate results are saved to /tmp/results/nvmf_results.conf 411Results directory path can be changed by -r script parameter. 412