1# Running NVMe-OF Performance Test Cases 2 3Scripts contained in this directory are used to run TCP and RDMA benchmark tests, 4that are later published at [spdk.io performance reports section](https://spdk.io/doc/performance_reports.html). 5To run the scripts in your environment please follow steps below. 6 7## Test Systems Requirements 8 9- The OS installed on test systems must be a Linux OS. 10 Scripts were primarily used on systems with Fedora and 11 Ubuntu 18.04/20.04 distributions. 12- Each test system must have at least one RDMA-capable NIC installed for RDMA tests. 13 For TCP tests any TCP-capable NIC will do. However, high-bandwidth, 14 high-performance NICs like Intel E810 CQDA2 or Mellanox ConnectX-5 are 15 suggested because the NVMe-oF workload is network bound. 16 So, if you use a NIC capable of less than 100Gbps on NVMe-oF target 17 system, you will quickly saturate your NICs. 18- Python3 interpreter must be available on all test systems. 19 Paramiko and Pandas modules must be installed. 20- nvmecli package must be installed on all test systems. 21- fio must be downloaded from [Github](https://github.com/axboe/fio) and built. 22 This must be done on Initiator test systems to later build SPDK with 23 "--with-fio" option. 24- All test systems must have a user account with a common name, 25 password and passwordless sudo enabled. 26- [mlnx-tools](https://github.com/Mellanox/mlnx-tools) package must be downloaded 27 to /usr/src/local directory in order to configure NIC ports IRQ affinity. 28 If custom directory is to be used, then it must be set using irq_scripts_dir 29 option in Target and Initiator configuration sections. 30 31### Optional 32 33- For test using the Kernel Target, nvmet-cli must be downloaded and build on Target system. 34 nvmet-cli is available [here](http://git.infradead.org/users/hch/nvmetcli.git). 35 36## Manual configuration 37 38Before running the scripts some manual test systems configuration is required: 39 40- Configure IP address assignment on the NIC ports that will be used for test. 41 Make sure to make these assignments persistent, as in some cases NIC drivers may be reloaded. 42- Adjust firewall service to allow traffic on IP - port pairs used in test 43 (or disable firewall service completely if possible). 44- Adjust or completely disable local security engines like AppArmor or SELinux. 45 46## JSON configuration for test run automation 47 48An example json configuration file with the minimum configuration required 49to automate NVMe-oF testing is provided in this repository. 50The following sub-chapters describe each configuration section in more detail. 51 52### General settings section 53 54``` ~sh 55"general": { 56 "username": "user", 57 "password": "password", 58 "transport": "transport_type", 59 "skip_spdk_install": bool 60} 61``` 62 63Required: 64 65- username - username for the SSH session 66- password - password for the SSH session 67- transport - transport layer to be used throughout the test ("tcp" or "rdma") 68 69Optional: 70 71- skip_spdk_install - by default SPDK sources will be copied from Target 72 to the Initiator systems each time run_nvmf.py script is run. If the SPDK 73 is already in place on Initiator systems and there's no need to re-build it, 74 then set this option to true. 75 Default: false. 76 77### Target System Configuration 78 79``` ~sh 80"target": { 81 "mode": "spdk", 82 "nic_ips": ["192.0.1.1", "192.0.2.1"], 83 "core_mask": "[1-10]", 84 "null_block_devices": 8, 85 "nvmet_bin": "/path/to/nvmetcli", 86 "sar_settings": [true, 30, 1, 60], 87 "pcm_settings": [/tmp/pcm, 30, 1, 60], 88 "enable_bandwidth": [true, 60], 89 "enable_dpdk_memory": [true, 30] 90 "num_shared_buffers": 4096, 91 "scheduler_settings": "static", 92 "zcopy_settings": false, 93 "dif_insert_strip": true, 94 "null_block_dif_type": 3 95} 96``` 97 98Required: 99 100- mode - Target application mode, "spdk" or "kernel". 101- nic_ips - IP addresses of NIC ports to be used by the target to export 102 NVMe-oF subsystems. 103- core_mask - Used by SPDK target only. 104 CPU core mask either in form of actual mask (i.e. 0xAAAA) or core list 105 (i.e. [0,1,2-5,6). 106 At this moment the scripts cannot restrict the Kernel target to only 107 use certain CPU cores. Important: upper bound of the range is inclusive! 108 109Optional, common: 110 111- null_block_devices - int, number of null block devices to create. 112 Detected NVMe devices are not used if option is present. Default: 0. 113- sar_settings - [bool, int(x), int(y), int(z)]; 114 Enable SAR CPU utilization measurement on Target side. 115 Wait for "x" seconds before starting measurements, then do "z" samples 116 with "y" seconds intervals between them. Default: disabled. 117- pcm_settings - [path, int(x), int(y), int(z)]; 118 Enable [PCM](https://github.com/opcm/pcm.git) measurements on Tartet side. 119 Measurements include CPU, memory and power consumption. "path" points to a 120 directory where pcm executables are present. Default: disabled. 121- enable_bandwidth - [bool, int]. Wait a given number of seconds and run 122 bwm-ng until the end of test to measure bandwidth utilization on network 123 interfaces. Default: disabled. 124- tuned_profile - tunedadm profile to apply on the system before starting 125 the test. 126- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 127 Used to run set_irq_affinity.sh script. 128 Default: /usr/src/local/mlnx-tools/ofed_scripts 129 130Optional, Kernel Target only: 131 132- nvmet_bin - path to nvmetcli binary, if not available in $PATH. 133 Only for Kernel Target. Default: "nvmetcli". 134 135Optional, SPDK Target only: 136 137- zcopy_settings - bool. Disable or enable target-size zero-copy option. 138 Default: false. 139- scheduler_settings - str. Select SPDK Target thread scheduler (static/dynamic). 140 Default: static. 141- num_shared_buffers - int, number of shared buffers to allocate when 142 creating transport layer. Default: 4096. 143- dif_insert_strip - bool. Only for TCP transport. Enable DIF option when 144 creating transport layer. Default: false. 145- null_block_dif_type - int, 0-3. Level of DIF type to use when creating 146 null block bdev. Default: 0. 147- enable_dpdk_memory - [bool, int]. Wait for a given number of seconds and 148 call env_dpdk_get_mem_stats RPC call to dump DPDK memory stats. Typically 149 wait time should be at least ramp_time of fio described in another section. 150- adq_enable - bool; only for TCP transport. 151 Configure system modules, NIC settings and create priority traffic classes 152 for ADQ testing. You need and ADQ-capable NIC like the Intel E810. 153 154### Initiator system settings section 155 156There can be one or more `initiatorX` setting sections, depending on the test setup. 157 158``` ~sh 159"initiator1": { 160 "ip": "10.0.0.1", 161 "nic_ips": ["192.0.1.2"], 162 "target_nic_ips": ["192.0.1.1"], 163 "mode": "spdk", 164 "fio_bin": "/path/to/fio/bin", 165 "nvmecli_bin": "/path/to/nvmecli/bin", 166 "cpus_allowed": "0,1,10-15", 167 "cpus_allowed_policy": "shared", 168 "num_cores": 4, 169 "cpu_frequency": 2100000, 170 "adq_enable": false 171} 172``` 173 174Required: 175 176- ip - management IP address of initiator system to set up SSH connection. 177- nic_ips - list of IP addresses of NIC ports to be used in test, 178 local to given initiator system. 179- target_nic_ips - list of IP addresses of Target NIC ports to which initiator 180 will attempt to connect to. 181- mode - initiator mode, "spdk" or "kernel". For SPDK, the bdev fio plugin 182 will be used to connect to NVMe-oF subsystems and submit I/O. For "kernel", 183 nvmecli will be used to connect to NVMe-oF subsystems and fio will use the 184 libaio ioengine to submit I/Os. 185 186Optional, common: 187 188- nvmecli_bin - path to nvmecli binary; Will be used for "discovery" command 189 (for both SPDK and Kernel modes) and for "connect" (in case of Kernel mode). 190 Default: system-wide "nvme". 191- fio_bin - path to custom fio binary, which will be used to run IO. 192 Additionally, the directory where the binary is located should also contain 193 fio sources needed to build SPDK fio_plugin for spdk initiator mode. 194 Default: /usr/src/fio/fio. 195- cpus_allowed - str, list of CPU cores to run fio threads on. Takes precedence 196 before `num_cores` setting. Default: None (CPU cores randomly allocated). 197 For more information see `man fio`. 198- cpus_allowed_policy - str, "shared" or "split". CPU sharing policy for fio 199 threads. Default: shared. For more information see `man fio`. 200- num_cores - By default fio threads on initiator side will use as many CPUs 201 as there are connected subsystems. This option limits the number of CPU cores 202 used for fio threads to this number; cores are allocated randomly and fio 203 `filename` parameters are grouped if needed. `cpus_allowed` option takes 204 precedence and `num_cores` is ignored if both are present in config. 205- cpu_frequency - int, custom CPU frequency to set. By default test setups are 206 configured to run in performance mode at max frequencies. This option allows 207 user to select CPU frequency instead of running at max frequency. Before 208 using this option `intel_pstate=disable` must be set in boot options and 209 cpupower governor be set to `userspace`. 210- tuned_profile - tunedadm profile to apply on the system before starting 211 the test. 212- irq_scripts_dir - path to scripts directory of Mellanox mlnx-tools package; 213 Used to run set_irq_affinity.sh script. 214 Default: /usr/src/local/mlnx-tools/ofed_scripts 215 216Optional, SPDK Initiator only: 217 218- adq_enable - bool; only for TCP transport. Configure system modules, NIC 219 settings and create priority traffic classes for ADQ testing. 220 You need an ADQ-capable NIC like Intel E810. 221 222### Fio settings section 223 224``` ~sh 225"fio": { 226 "bs": ["4k", "128k"], 227 "qd": [32, 128], 228 "rw": ["randwrite", "write"], 229 "rwmixread": 100, 230 "rate_iops": 10000, 231 "num_jobs": 2, 232 "run_time": 30, 233 "ramp_time": 30, 234 "run_num": 3 235} 236``` 237 238Required: 239 240- bs - fio IO block size 241- qd - fio iodepth 242- rw - fio rw mode 243- rwmixread - read operations percentage in case of mixed workloads 244- num_jobs - fio numjobs parameter 245 Note: may affect total number of CPU cores used by initiator systems 246- run_time - fio run time 247- ramp_time - fio ramp time, does not do measurements 248- run_num - number of times each workload combination is run. 249 If more than 1 then final result is the average of all runs. 250 251Optional: 252 253- rate_iops - limit IOPS to this number 254 255#### Test Combinations 256 257It is possible to specify more than one value for bs, qd and rw parameters. 258In such case script creates a list of their combinations and runs IO tests 259for all of these combinations. For example, the following configuration: 260 261``` ~sh 262 "bs": ["4k"], 263 "qd": [32, 128], 264 "rw": ["write", "read"] 265``` 266 267results in following workloads being tested: 268 269- 4k-write-32 270- 4k-write-128 271- 4k-read-32 272- 4k-read-128 273 274#### Important note about queue depth parameter 275 276qd in fio settings section refers to iodepth generated per single fio target 277device ("filename" in resulting fio configuration file). It is re-calculated 278while the script is running, so generated fio configuration file might contain 279a different value than what user has specified at input, especially when also 280using "numjobs" or initiator "num_cores" parameters. For example: 281 282Target system exposes 4 NVMe-oF subsystems. One initiator system connects to 283all of these systems. 284 285Initiator configuration (relevant settings only): 286 287``` ~sh 288"initiator1": { 289 "num_cores": 1 290} 291``` 292 293Fio configuration: 294 295``` ~sh 296"fio": { 297 "bs": ["4k"], 298 "qd": [128], 299 "rw": ["randread"], 300 "rwmixread": 100, 301 "num_jobs": 1, 302 "run_time": 30, 303 "ramp_time": 30, 304 "run_num": 1 305} 306``` 307 308In this case generated fio configuration will look like this 309(relevant settings only): 310 311``` ~sh 312[global] 313numjobs=1 314 315[job_section0] 316filename=Nvme0n1 317filename=Nvme1n1 318filename=Nvme2n1 319filename=Nvme3n1 320iodepth=512 321``` 322 323`num_cores` option results in 4 connected subsystems to be grouped under a 324single fio thread (job_section0). Because `iodepth` is local to `job_section0`, 325it is distributed between each `filename` local to job section in round-robin 326(by default) fashion. In case of fio targets with the same characteristics 327(IOPS & Bandwidth capabilities) it means that iodepth is distributed **roughly** 328equally. Ultimately above fio configuration results in iodepth=128 per filename. 329 330`numjobs` higher than 1 is also taken into account, so that desired qd per 331filename is retained: 332 333``` ~sh 334[global] 335numjobs=2 336 337[job_section0] 338filename=Nvme0n1 339filename=Nvme1n1 340filename=Nvme2n1 341filename=Nvme3n1 342iodepth=256 343``` 344 345Besides `run_num`, more information on these options can be found in `man fio`. 346 347## Running the test 348 349Before running the test script run the spdk/scripts/setup.sh script on Target 350system. This binds the devices to VFIO/UIO userspace driver and allocates 351hugepages for SPDK process. 352 353Run the script on the NVMe-oF target system: 354 355``` ~sh 356cd spdk 357sudo PYTHONPATH=$PYTHONPATH:$PWD/scripts scripts/perf/nvmf/run_nvmf.py 358``` 359 360By default script uses config.json configuration file in the scripts/perf/nvmf 361directory. You can specify a different configuration file at runtime as below: 362 363``` ~sh 364sudo PYTHONPATH=$PYTHONPATH:$PWD/scripts scripts/perf/nvmf/run_nvmf.py -c /path/to/config.json 365``` 366 367PYTHONPATH environment variable is needed because script uses SPDK-local Python 368modules. If you'd like to get rid of `PYTHONPATH=$PYTHONPATH:$PWD/scripts` 369you need to modify your environment so that Python interpreter is aware of 370`spdk/scripts` directory. 371 372## Test Results 373 374Test results for all workload combinations are printed to screen once the tests 375are finished. Additionally all aggregate results are saved to /tmp/results/nvmf_results.conf 376Results directory path can be changed by -r script parameter. 377