1# Running NVMe-OF Performance Test Cases 2 3Scripts contained in this directory are used to run TCP and RDMA benchmark tests, 4that are later published at [spdk.io performance reports section](https://spdk.io/doc/performance_reports.html). 5To run the scripts in your environment please follow steps below. 6 7## Test Systems Requirements 8 9- The OS installed on test systems must be a Linux OS. 10 Scripts were primarily used on systems with Fedora and 11 Ubuntu 18.04/20.04 distributions. 12- Each test system must have at least one RDMA-capable NIC installed for RDMA tests. 13 For TCP tests any TCP-capable NIC will do. However, high-bandwidth, 14 high-performance NICs like Intel E810 CQDA2 or Mellanox ConnectX-5 are 15 suggested because the NVMe-oF workload is network bound. 16 So, if you use a NIC capable of less than 100Gbps on NVMe-oF target 17 system, you will quickly saturate your NICs. 18- Python3 interpreter must be available on all test systems. 19 Paramiko and Pandas modules must be installed. 20- nvmecli package must be installed on all test systems. 21- fio must be downloaded from [Github](https://github.com/axboe/fio) and built. 22 This must be done on Initiator test systems to later build SPDK with 23 "--with-fio" option. 24- All test systems must have a user account with a common name, 25 password and passwordless sudo enabled. 26- [mlnx-tools](https://github.com/Mellanox/mlnx-tools) package must be downloaded 27 to /usr/src/local directory in order to configure NIC ports IRQ affinity. 28 If custom directory is to be used, then it must be set using irq_scripts_dir 29 option in Target and Initiator configuration sections. 30 31### Optional 32 33- For test using the Kernel Target, nvmet-cli must be downloaded and build on Target system. 34 nvmet-cli is available [here](http://git.infradead.org/users/hch/nvmetcli.git). 35 36## Manual configuration 37 38Before running the scripts some manual test systems configuration is required: 39 40- Configure IP address assignment on the NIC ports that will be used for test. 41 Make sure to make these assignments persistent, as in some cases NIC drivers may be reloaded. 42- Adjust firewall service to allow traffic on IP - port pairs used in test 43 (or disable firewall service completely if possible). 44- Adjust or completely disable local security engines like AppArmor or SELinux. 45 46## JSON configuration for test run automation 47 48An example json configuration file with the minimum configuration required 49to automate NVMe-oF testing is provided in this repository. 50The following sub-chapters describe each configuration section in more detail. 51 52### General settings section 53 54``` ~sh 55"general": { 56 "username": "user", 57 "password": "password", 58 "transport": "transport_type", 59 "skip_spdk_install": bool 60} 61``` 62 63Required: 64 65- username - username for the SSH session 66- password - password for the SSH session 67- transport - transport layer to be used throughout the test ("tcp" or "rdma") 68 69Optional: 70 71- skip_spdk_install - by default SPDK sources will be copied from Target 72 to the Initiator systems each time run_nvmf.py script is run. If the SPDK 73 is already in place on Initiator systems and there's no need to re-build it, 74 then set this option to true. 75 Default: false. 76 77### Target System Configuration 78 79``` ~sh 80"target": { 81 "mode": "spdk", 82 "nic_ips": ["192.0.1.1", "192.0.2.1"], 83 "core_mask": "[1-10]", 84 "null_block_devices": 8, 85 "nvmet_bin": "/path/to/nvmetcli", 86 "sar_settings": [true, 30, 1, 60], 87 "pcm_settings": [/tmp/pcm, 30, 1, 60], 88 "enable_bandwidth": [true, 60], 89 "enable_dpdk_memory": [true, 30] 90 "num_shared_buffers": 4096, 91 "scheduler_settings": "static", 92 "zcopy_settings": false, 93 "dif_insert_strip": true, 94 "null_block_dif_type": 3, 95 "enable_pm": true 96} 97``` 98 99Required: 100 101- mode - Target application mode, "spdk" or "kernel". 102- nic_ips - IP addresses of NIC ports to be used by the target to export 103 NVMe-oF subsystems. 104- core_mask - Used by SPDK target only. 105 CPU core mask either in form of actual mask (i.e. 0xAAAA) or core list 106 (i.e. [0,1,2-5,6). 107 At this moment the scripts cannot restrict the Kernel target to only 108 use certain CPU cores. Important: upper bound of the range is inclusive! 109 110Optional, common: 111 112- null_block_devices - int, number of null block devices to create. 113 Detected NVMe devices are not used if option is present. Default: 0. 114- sar_settings - [bool, int(x), int(y), int(z)]; 115 Enable SAR CPU utilization measurement on Target side. 116 Wait for "x" seconds before starting measurements, then do "z" samples 117 with "y" seconds intervals between them. Default: disabled. 118- pcm_settings - [path, int(x), int(y), int(z)]; 119 Enable [PCM](https://github.com/opcm/pcm.git) measurements on Target side. 120 Measurements include CPU, memory and power consumption. "path" points to a 121 directory where pcm executables are present. 122 "x" - time to wait before starting measurements (suggested it equals to fio 123 ramp_time). 124 "y" - time interval between measurements. 125 "z" - number of measurement samples. 126 Default: disabled. 127- enable_bandwidth - [bool, int]. Wait a given number of seconds and run 128 bwm-ng until the end of test to measure bandwidth utilization on network 129 interfaces. Default: disabled. 130- tuned_profile - tunedadm profile to apply on the system before starting 131 the test. 132- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 133 Used to run set_irq_affinity.sh script. 134 Default: /usr/src/local/mlnx-tools/ofed_scripts 135- enable_pm - if set to true, power measurement is enabled via collect-bmc-pm 136 on the target side. 137 138Optional, Kernel Target only: 139 140- nvmet_bin - path to nvmetcli binary, if not available in $PATH. 141 Only for Kernel Target. Default: "nvmetcli". 142 143Optional, SPDK Target only: 144 145- zcopy_settings - bool. Disable or enable target-size zero-copy option. 146 Default: false. 147- scheduler_settings - str. Select SPDK Target thread scheduler (static/dynamic). 148 Default: static. 149- num_shared_buffers - int, number of shared buffers to allocate when 150 creating transport layer. Default: 4096. 151- max_queue_depth - int, max number of outstanding I/O per queue. Default: 128. 152- dif_insert_strip - bool. Only for TCP transport. Enable DIF option when 153 creating transport layer. Default: false. 154- null_block_dif_type - int, 0-3. Level of DIF type to use when creating 155 null block bdev. Default: 0. 156- enable_dpdk_memory - [bool, int]. Wait for a given number of seconds and 157 call env_dpdk_get_mem_stats RPC call to dump DPDK memory stats. Typically 158 wait time should be at least ramp_time of fio described in another section. 159- adq_enable - bool; only for TCP transport. 160 Configure system modules, NIC settings and create priority traffic classes 161 for ADQ testing. You need and ADQ-capable NIC like the Intel E810. 162- bpf_scripts - list of bpftrace scripts that will be attached during the 163 test run. Available scripts can be found in the spdk/scripts/bpf directory. 164- dsa_settings - bool. Only for TCP transport. Enable offloading CRC32C 165 calculation to DSA. You need a CPU with the Intel(R) Data Streaming 166 Accelerator (DSA) engine. 167- scheduler_core_limit - int, 0-100. Dynamic scheduler option to load limit on 168 the core to be considered full. 169 170### Initiator system settings section 171 172There can be one or more `initiatorX` setting sections, depending on the test setup. 173 174``` ~sh 175"initiator1": { 176 "ip": "10.0.0.1", 177 "nic_ips": ["192.0.1.2"], 178 "target_nic_ips": ["192.0.1.1"], 179 "mode": "spdk", 180 "fio_bin": "/path/to/fio/bin", 181 "nvmecli_bin": "/path/to/nvmecli/bin", 182 "cpus_allowed": "0,1,10-15", 183 "cpus_allowed_policy": "shared", 184 "num_cores": 4, 185 "cpu_frequency": 2100000, 186 "adq_enable": false, 187 "kernel_engine": "io_uring" 188} 189``` 190 191Required: 192 193- ip - management IP address of initiator system to set up SSH connection. 194- nic_ips - list of IP addresses of NIC ports to be used in test, 195 local to given initiator system. 196- target_nic_ips - list of IP addresses of Target NIC ports to which initiator 197 will attempt to connect to. 198- mode - initiator mode, "spdk" or "kernel". For SPDK, the bdev fio plugin 199 will be used to connect to NVMe-oF subsystems and submit I/O. For "kernel", 200 nvmecli will be used to connect to NVMe-oF subsystems and fio will use the 201 libaio ioengine to submit I/Os. 202 203Optional, common: 204 205- nvmecli_bin - path to nvmecli binary; Will be used for "discovery" command 206 (for both SPDK and Kernel modes) and for "connect" (in case of Kernel mode). 207 Default: system-wide "nvme". 208- fio_bin - path to custom fio binary, which will be used to run IO. 209 Additionally, the directory where the binary is located should also contain 210 fio sources needed to build SPDK fio_plugin for spdk initiator mode. 211 Default: /usr/src/fio/fio. 212- cpus_allowed - str, list of CPU cores to run fio threads on. Takes precedence 213 before `num_cores` setting. Default: None (CPU cores randomly allocated). 214 For more information see `man fio`. 215- cpus_allowed_policy - str, "shared" or "split". CPU sharing policy for fio 216 threads. Default: shared. For more information see `man fio`. 217- num_cores - By default fio threads on initiator side will use as many CPUs 218 as there are connected subsystems. This option limits the number of CPU cores 219 used for fio threads to this number; cores are allocated randomly and fio 220 `filename` parameters are grouped if needed. `cpus_allowed` option takes 221 precedence and `num_cores` is ignored if both are present in config. 222- cpu_frequency - int, custom CPU frequency to set. By default test setups are 223 configured to run in performance mode at max frequencies. This option allows 224 user to select CPU frequency instead of running at max frequency. Before 225 using this option `intel_pstate=disable` must be set in boot options and 226 cpupower governor be set to `userspace`. 227- tuned_profile - tunedadm profile to apply on the system before starting 228 the test. 229- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 230 Used to run set_irq_affinity.sh script. 231 Default: /usr/src/local/mlnx-tools/ofed_scripts 232- kernel_engine - Select fio ioengine mode to run tests. io_uring libraries and 233 io_uring capable fio binaries must be present on Initiator systems! 234 Available options: 235 - libaio (default) 236 - io_uring 237 238Optional, SPDK Initiator only: 239 240- adq_enable - bool; only for TCP transport. Configure system modules, NIC 241 settings and create priority traffic classes for ADQ testing. 242 You need an ADQ-capable NIC like Intel E810. 243- enable_data_digest - bool; only for TCP transport. Enable the data 244 digest for the bdev controller. The target can use IDXD to calculate the 245 data digest or fallback to a software optimized implementation on system 246 that don't have the Intel(R) Data Streaming Accelerator (DSA) engine. 247 248### Fio settings section 249 250``` ~sh 251"fio": { 252 "bs": ["4k", "128k"], 253 "qd": [32, 128], 254 "rw": ["randwrite", "write"], 255 "rwmixread": 100, 256 "rate_iops": 10000, 257 "num_jobs": 2, 258 "offset": true, 259 "offset_inc": 10, 260 "run_time": 30, 261 "ramp_time": 30, 262 "run_num": 3 263} 264``` 265 266Required: 267 268- bs - fio IO block size 269- qd - fio iodepth 270- rw - fio rw mode 271- rwmixread - read operations percentage in case of mixed workloads 272- num_jobs - fio numjobs parameter 273 Note: may affect total number of CPU cores used by initiator systems 274- run_time - fio run time 275- ramp_time - fio ramp time, does not do measurements 276- run_num - number of times each workload combination is run. 277 If more than 1 then final result is the average of all runs. 278 279Optional: 280 281- rate_iops - limit IOPS to this number 282- offset - bool; enable offseting of the IO to the file. When this option is 283 enabled the file is "split" into a number of chunks equal to "num_jobs" 284 parameter value, and each "num_jobs" fio thread gets it's own chunk to 285 work with. 286 For more detail see "offset", "offset_increment" and "size" in fio man 287 pages. Default: false. 288- offset_inc - int; Percentage value determining the offset, size and 289 offset_increment when "offset" option is enabled. By default if "offset" 290 is enabled fio file will get split evenly between fio threads doing the 291 IO. Offset_inc can be used to specify a custom value. 292 293#### Test Combinations 294 295It is possible to specify more than one value for bs, qd and rw parameters. 296In such case script creates a list of their combinations and runs IO tests 297for all of these combinations. For example, the following configuration: 298 299``` ~sh 300 "bs": ["4k"], 301 "qd": [32, 128], 302 "rw": ["write", "read"] 303``` 304 305results in following workloads being tested: 306 307- 4k-write-32 308- 4k-write-128 309- 4k-read-32 310- 4k-read-128 311 312#### Important note about queue depth parameter 313 314qd in fio settings section refers to iodepth generated per single fio target 315device ("filename" in resulting fio configuration file). It is re-calculated 316while the script is running, so generated fio configuration file might contain 317a different value than what user has specified at input, especially when also 318using "numjobs" or initiator "num_cores" parameters. For example: 319 320Target system exposes 4 NVMe-oF subsystems. One initiator system connects to 321all of these systems. 322 323Initiator configuration (relevant settings only): 324 325``` ~sh 326"initiator1": { 327 "num_cores": 1 328} 329``` 330 331Fio configuration: 332 333``` ~sh 334"fio": { 335 "bs": ["4k"], 336 "qd": [128], 337 "rw": ["randread"], 338 "rwmixread": 100, 339 "num_jobs": 1, 340 "run_time": 30, 341 "ramp_time": 30, 342 "run_num": 1 343} 344``` 345 346In this case generated fio configuration will look like this 347(relevant settings only): 348 349``` ~sh 350[global] 351numjobs=1 352 353[job_section0] 354filename=Nvme0n1 355filename=Nvme1n1 356filename=Nvme2n1 357filename=Nvme3n1 358iodepth=512 359``` 360 361`num_cores` option results in 4 connected subsystems to be grouped under a 362single fio thread (job_section0). Because `iodepth` is local to `job_section0`, 363it is distributed between each `filename` local to job section in round-robin 364(by default) fashion. In case of fio targets with the same characteristics 365(IOPS & Bandwidth capabilities) it means that iodepth is distributed **roughly** 366equally. Ultimately above fio configuration results in iodepth=128 per filename. 367 368`numjobs` higher than 1 is also taken into account, so that desired qd per 369filename is retained: 370 371``` ~sh 372[global] 373numjobs=2 374 375[job_section0] 376filename=Nvme0n1 377filename=Nvme1n1 378filename=Nvme2n1 379filename=Nvme3n1 380iodepth=256 381``` 382 383Besides `run_num`, more information on these options can be found in `man fio`. 384 385## Running the test 386 387Before running the test script run the spdk/scripts/setup.sh script on Target 388system. This binds the devices to VFIO/UIO userspace driver and allocates 389hugepages for SPDK process. 390 391Run the script on the NVMe-oF target system: 392 393``` ~sh 394cd spdk 395sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py 396``` 397 398By default script uses config.json configuration file in the scripts/perf/nvmf 399directory. You can specify a different configuration file at runtime as below: 400 401``` ~sh 402sudo PYTHONPATH=$PYTHONPATH:$PWD/python scripts/perf/nvmf/run_nvmf.py -c /path/to/config.json 403``` 404 405PYTHONPATH environment variable is needed because script uses SPDK-local Python 406modules. If you'd like to get rid of `PYTHONPATH=$PYTHONPATH:$PWD/python` 407you need to modify your environment so that Python interpreter is aware of 408`spdk/scripts` directory. 409 410## Test Results 411 412Test results for all workload combinations are printed to screen once the tests 413are finished. Additionally all aggregate results are saved to /tmp/results/nvmf_results.conf 414Results directory path can be changed by -r script parameter. 415