1# Block Device User Guide {#bdev} 2 3# Introduction {#bdev_ug_introduction} 4 5The SPDK block device layer, often simply called *bdev*, is a C library 6intended to be equivalent to the operating system block storage layer that 7often sits immediately above the device drivers in a traditional kernel 8storage stack. Specifically, this library provides the following 9functionality: 10 11* A pluggable module API for implementing block devices that interface with different types of block storage devices. 12* Driver modules for NVMe, malloc (ramdisk), Linux AIO, virtio-scsi, Ceph RBD, Pmem and Vhost-SCSI Initiator and more. 13* An application API for enumerating and claiming SPDK block devices and then performing operations (read, write, unmap, etc.) on those devices. 14* Facilities to stack block devices to create complex I/O pipelines, including logical volume management (lvol) and partition support (GPT). 15* Configuration of block devices via JSON-RPC. 16* Request queueing, timeout, and reset handling. 17* Multiple, lockless queues for sending I/O to block devices. 18 19Bdev module creates abstraction layer that provides common API for all devices. 20User can use available bdev modules or create own module with any type of 21device underneath (please refer to @ref bdev_module for details). SPDK 22provides also vbdev modules which creates block devices on existing bdev. For 23example @ref bdev_ug_logical_volumes or @ref bdev_ug_gpt 24 25# Prerequisites {#bdev_ug_prerequisites} 26 27This guide assumes that you can already build the standard SPDK distribution 28on your platform. The block device layer is a C library with a single public 29header file named bdev.h. All SPDK configuration described in following 30chapters is done by using JSON-RPC commands. SPDK provides a python-based 31command line tool for sending RPC commands located at `scripts/rpc.py`. User 32can list available commands by running this script with `-h` or `--help` flag. 33Additionally user can retrieve currently supported set of RPC commands 34directly from SPDK application by running `scripts/rpc.py rpc_get_methods`. 35Detailed help for each command can be displayed by adding `-h` flag as a 36command parameter. 37 38# General Purpose RPCs {#bdev_ug_general_rpcs} 39 40## get_bdevs {#bdev_ug_get_bdevs} 41 42List of currently available block devices including detailed information about 43them can be get by using `get_bdevs` RPC command. User can add optional 44parameter `name` to get details about specified by that name bdev. 45 46Example response 47 48~~~ 49{ 50 "num_blocks": 32768, 51 "assigned_rate_limits": { 52 "rw_ios_per_sec": 10000, 53 "rw_mbytes_per_sec": 20 54 }, 55 "supported_io_types": { 56 "reset": true, 57 "nvme_admin": false, 58 "unmap": true, 59 "read": true, 60 "write_zeroes": true, 61 "write": true, 62 "flush": true, 63 "nvme_io": false 64 }, 65 "driver_specific": {}, 66 "claimed": false, 67 "block_size": 4096, 68 "product_name": "Malloc disk", 69 "name": "Malloc0" 70} 71~~~ 72 73## set_bdev_qos_limit {#set_bdev_qos_limit} 74 75Users can use the `set_bdev_qos_limit` RPC command to enable, adjust, and disable 76rate limits on an existing bdev. Two types of rate limits are supported: 77IOPS and bandwidth. The rate limits can be enabled, adjusted, and disabled at any 78time for the specified bdev. The bdev name is a required parameter for this 79RPC command and at least one of `rw_ios_per_sec` and `rw_mbytes_per_sec` must be 80specified. When both rate limits are enabled, the first met limit will 81take effect. The value 0 may be specified to disable the corresponding rate 82limit. Users can run this command with `-h` or `--help` for more information. 83 84## Histograms {#rpc_bdev_histogram} 85 86The `enable_bdev_histogram` RPC command allows to enable or disable gathering 87latency data for specified bdev. Histogram can be downloaded by the user by 88calling `get_bdev_histogram` and parsed using scripts/histogram.py script. 89 90Example command 91 92`rpc.py enable_bdev_histogram Nvme0n1 --enable` 93 94The command will enable gathering data for histogram on Nvme0n1 device. 95 96`rpc.py get_bdev_histogram Nvme0n1 | histogram.py` 97 98The command will download gathered histogram data. The script will parse 99the data and show table containing IO count for latency ranges. 100 101`rpc.py enable_bdev_histogram Nvme0n1 --disable` 102 103The command will disable histogram on Nvme0n1 device. 104 105# Ceph RBD {#bdev_config_rbd} 106 107The SPDK RBD bdev driver provides SPDK block layer access to Ceph RADOS block 108devices (RBD). Ceph RBD devices are accessed via librbd and librados libraries 109to access the RADOS block device exported by Ceph. To create Ceph bdev RPC 110command `construct_rbd_bdev` should be used. 111 112Example command 113 114`rpc.py construct_rbd_bdev rbd foo 512` 115 116This command will create a bdev that represents the 'foo' image from a pool called 'rbd'. 117 118To remove a block device representation use the delete_rbd_bdev command. 119 120`rpc.py delete_rbd_bdev Rbd0` 121 122# Compression Virtual Bdev Module {#bdev_config_compress} 123 124The compression bdev module can be configured to provide compression/decompression 125services for an underlying thinly provisioned logical volume. Although the underlying 126module can be anything (i.e. NVME bdev) the overall compression benefits will not be realized 127unless the data stored on disk is placed appropriately. The compression vbdev module 128relies on an internal SPDK library called `reduce` to accomplish this, see @ref reduce 129for detailed information. 130 131The vbdev module relies on the DPDK CompressDev Framework to provide all compression 132functionality. The framework provides support for many different software only 133compression modules as well as hardware assisted support for Intel QAT. At this 134time the vbdev module supports the DPDK drivers for ISAL and QAT. 135 136Persistent memory is used to store metadata associated with the layout of the data on the 137backing device. SPDK relies on [PMDK](http://pmem.io/pmdk/) to interface persistent memory so any hardware 138supported by PMDK should work. If the directory for PMEM supplied upon vbdev creation does 139not point to persistent memory (i.e. a regular filesystem) performance will be severely 140impacted. The vbdev module and reduce libraries were designed to use persistent memory for 141any production use. 142 143Example command 144 145`rpc.py bdev_compress_create -p /pmem_files -b myLvol` 146 147In this example, a compression vbdev is created using persistent memory that is mapped to 148the directory `pmem_files` on top of the existing thinly provisioned logical volume `myLvol`. 149The resulting compression bdev will be named `COMP_LVS/myLvol` where LVS is the name of the 150logical volume store that `myLvol` resides on. 151 152The logical volume is referred to as the backing device and once the compression vbdev is 153created it cannot be separated from the persistent memory file that will be created in 154the specified directory. If the persistent memory file is not available, the compression 155vbdev will also not be available. 156 157By default the vbdev module will choose the QAT driver if the hardware and drivers are 158available and loaded. If not, it will revert to the software-only ISAL driver. By using 159the following command, the driver may be specified however this is not persistent so it 160must be done either upon creation or before the underlying logical volume is loaded to 161be honored. In the example below, `0` is telling the vbdev module to use QAT if available 162otherwise use ISAL, this is the default and if sufficient the command is not required. Passing 163a value of 1 tells the driver to use QAT and if not available then the creation or loading 164the vbdev should fail to create or load. A value of '2' as shown below tells the module 165to use ISAL and if for some reason it is not available, the vbdev should fail to create or load. 166 167`rpc.py set_compress_pmd -p 2` 168 169To remove a compression vbdev, use the following command which will also delete the PMEM 170file. If the logical volume is deleted the PMEM file will not be removed and the 171compression vbdev will not be available. 172 173`rpc.py bdev_compress_delete COMP_LVS/myLvol` 174 175To list compression volumes that are only available for deletion because their PMEM file 176was missing use the following. The name parameter is optional and if not included will list 177all volumes, if used it will return the name or an error that the device does not exist. 178 179`rpc.py bdev_compress_get_orphans --name COMP_Nvme0n1` 180 181# Crypto Virtual Bdev Module {#bdev_config_crypto} 182 183The crypto virtual bdev module can be configured to provide at rest data encryption 184for any underlying bdev. The module relies on the DPDK CryptoDev Framework to provide 185all cryptographic functionality. The framework provides support for many different software 186only cryptographic modules as well hardware assisted support for the Intel QAT board. The 187framework also provides support for cipher, hash, authentication and AEAD functions. At this 188time the SPDK virtual bdev module supports cipher only as follows: 189 190- AESN-NI Multi Buffer Crypto Poll Mode Driver: RTE_CRYPTO_CIPHER_AES128_CBC 191- Intel(R) QuickAssist (QAT) Crypto Poll Mode Driver: RTE_CRYPTO_CIPHER_AES128_CBC 192(Note: QAT is functional however is marked as experimental until the hardware has 193been fully integrated with the SPDK CI system.) 194 195In order to support using the bdev block offset (LBA) as the initialization vector (IV), 196the crypto module break up all I/O into crypto operations of a size equal to the block 197size of the underlying bdev. For example, a 4K I/O to a bdev with a 512B block size, 198would result in 8 cryptographic operations. 199 200For reads, the buffer provided to the crypto module will be used as the destination buffer 201for unencrypted data. For writes, however, a temporary scratch buffer is used as the 202destination buffer for encryption which is then passed on to the underlying bdev as the 203write buffer. This is done to avoid encrypting the data in the original source buffer which 204may cause problems in some use cases. 205 206Example command 207 208`rpc.py bdev_crypto_create NVMe1n1 CryNvmeA crypto_aesni_mb 0123456789123456` 209 210This command will create a crypto vbdev called 'CryNvmeA' on top of the NVMe bdev 211'NVMe1n1' and will use the DPDK software driver 'crypto_aesni_mb' and the key 212'0123456789123456'. 213 214To remove the vbdev use the bdev_crypto_delete command. 215 216`rpc.py bdev_crypto_delete CryNvmeA` 217 218# Delay Bdev Module {#bdev_config_delay} 219 220The delay vbdev module is intended to apply a predetermined additional latency on top of a lower 221level bdev. This enables the simulation of the latency characteristics of a device during the functional 222or scalability testing of an SPDK application. For example, to simulate the effect of drive latency when 223processing I/Os, one could configure a NULL bdev with a delay bdev on top of it. 224 225The delay bdev module is not intended to provide a high fidelity replication of a specific NVMe drive's latency, 226instead it's main purpose is to provide a "big picture" understanding of how a generic latency affects a given 227application. 228 229A delay bdev is created using the `bdev_delay_create` RPC. This rpc takes 6 arguments, one for the name 230of the delay bdev and one for the name of the base bdev. The remaining four arguments represent the following 231latency values: average read latency, average write latency, p99 read latency, and p99 write latency. 232Within the context of the delay bdev p99 latency means that one percent of the I/O will be delayed by at 233least by the value of the p99 latency before being completed to the upper level protocol. All of the latency values 234are measured in microseconds. 235 236Example command: 237 238`rpc.py bdev_delay_create -b Null0 -d delay0 -r 10 --nine-nine-read-latency 50 -w 30 --nine-nine-write-latency 90` 239 240This command will create a delay bdev with average read and write latencies of 10 and 30 microseconds and p99 read 241and write latencies of 50 and 90 microseconds respectively. 242 243A delay bdev can be deleted using the `bdev_delay_delete` RPC 244 245Example command: 246 247`rpc.py bdev_delay_delete delay0` 248 249# GPT (GUID Partition Table) {#bdev_config_gpt} 250 251The GPT virtual bdev driver is enabled by default and does not require any configuration. 252It will automatically detect @ref bdev_ug_gpt on any attached bdev and will create 253possibly multiple virtual bdevs. 254 255## SPDK GPT partition table {#bdev_ug_gpt} 256 257The SPDK partition type GUID is `7c5222bd-8f5d-4087-9c00-bf9843c7b58c`. Existing SPDK bdevs 258can be exposed as Linux block devices via NBD and then ca be partitioned with 259standard partitioning tools. After partitioning, the bdevs will need to be deleted and 260attached again for the GPT bdev module to see any changes. NBD kernel module must be 261loaded first. To create NBD bdev user should use `start_nbd_disk` RPC command. 262 263Example command 264 265`rpc.py start_nbd_disk Malloc0 /dev/nbd0` 266 267This will expose an SPDK bdev `Malloc0` under the `/dev/nbd0` block device. 268 269To remove NBD device user should use `stop_nbd_disk` RPC command. 270 271Example command 272 273`rpc.py stop_nbd_disk /dev/nbd0` 274 275To display full or specified nbd device list user should use `get_nbd_disks` RPC command. 276 277Example command 278 279`rpc.py stop_nbd_disk -n /dev/nbd0` 280 281## Creating a GPT partition table using NBD {#bdev_ug_gpt_create_part} 282 283~~~ 284# Expose bdev Nvme0n1 as kernel block device /dev/nbd0 by JSON-RPC 285rpc.py start_nbd_disk Nvme0n1 /dev/nbd0 286 287# Create GPT partition table. 288parted -s /dev/nbd0 mklabel gpt 289 290# Add a partition consuming 50% of the available space. 291parted -s /dev/nbd0 mkpart MyPartition '0%' '50%' 292 293# Change the partition type to the SPDK GUID. 294# sgdisk is part of the gdisk package. 295sgdisk -t 1:7c5222bd-8f5d-4087-9c00-bf9843c7b58c /dev/nbd0 296 297# Stop the NBD device (stop exporting /dev/nbd0). 298rpc.py stop_nbd_disk /dev/nbd0 299 300# Now Nvme0n1 is configured with a GPT partition table, and 301# the first partition will be automatically exposed as 302# Nvme0n1p1 in SPDK applications. 303~~~ 304 305# iSCSI bdev {#bdev_config_iscsi} 306 307The SPDK iSCSI bdev driver depends on libiscsi and hence is not enabled by default. 308In order to use it, build SPDK with an extra `--with-iscsi-initiator` configure option. 309 310The following command creates an `iSCSI0` bdev from a single LUN exposed at given iSCSI URL 311with `iqn.2016-06.io.spdk:init` as the reported initiator IQN. 312 313`rpc.py bdev_iscsi_create -b iSCSI0 -i iqn.2016-06.io.spdk:init --url iscsi://127.0.0.1/iqn.2016-06.io.spdk:disk1/0` 314 315The URL is in the following format: 316`iscsi://[<username>[%<password>]@]<host>[:<port>]/<target-iqn>/<lun>` 317 318# Linux AIO bdev {#bdev_config_aio} 319 320The SPDK AIO bdev driver provides SPDK block layer access to Linux kernel block 321devices or a file on a Linux filesystem via Linux AIO. Note that O_DIRECT is 322used and thus bypasses the Linux page cache. This mode is probably as close to 323a typical kernel based target as a user space target can get without using a 324user-space driver. To create AIO bdev RPC command `bdev_aio_create` should be 325used. 326 327Example commands 328 329`rpc.py bdev_aio_create /dev/sda aio0` 330 331This command will create `aio0` device from /dev/sda. 332 333`rpc.py bdev_aio_create /tmp/file file 8192` 334 335This command will create `file` device with block size 8192 from /tmp/file. 336 337To delete an aio bdev use the bdev_aio_delete command. 338 339`rpc.py bdev_aio_delete aio0` 340 341# OCF Virtual bdev {#bdev_config_cas} 342 343OCF virtual bdev module is based on [Open CAS Framework](https://github.com/Open-CAS/ocf) - a 344high performance block storage caching meta-library. 345To enable the module, configure SPDK using `--with-ocf` flag. 346OCF bdev can be used to enable caching for any underlying bdev. 347 348Below is an example command for creating OCF bdev: 349 350`rpc.py construct_ocf_bdev Cache1 wt Malloc0 Nvme0n1` 351 352This command will create new OCF bdev `Cache1` having bdev `Malloc0` as caching-device 353and `Nvme0n1` as core-device and initial cache mode `Write-Through`. 354`Malloc0` will be used as cache for `Nvme0n1`, so data written to `Cache1` will be present 355on `Nvme0n1` eventually. 356By default, OCF will be configured with cache line size equal 4KiB 357and non-volatile metadata will be disabled. 358 359To remove `Cache1`: 360 361`rpc.py delete_ocf_bdev Cache1` 362 363During removal OCF-cache will be stopped and all cached data will be written to the core device. 364 365Note that OCF has a per-device RAM requirement 366of about 56000 + _cache device size_ * 58 / _cache line size_ (in bytes). 367To get more information on OCF 368please visit [OCF documentation](https://open-cas.github.io/). 369 370# Malloc bdev {#bdev_config_malloc} 371 372Malloc bdevs are ramdisks. Because of its nature they are volatile. They are created from hugepage memory given to SPDK 373application. 374 375# Null {#bdev_config_null} 376 377The SPDK null bdev driver is a dummy block I/O target that discards all writes and returns undefined 378data for reads. It is useful for benchmarking the rest of the bdev I/O stack with minimal block 379device overhead and for testing configurations that can't easily be created with the Malloc bdev. 380To create Null bdev RPC command `construct_null_bdev` should be used. 381 382Example command 383 384`rpc.py construct_null_bdev Null0 8589934592 4096` 385 386This command will create an 8 petabyte `Null0` device with block size 4096. 387 388To delete a null bdev use the delete_null_bdev command. 389 390`rpc.py delete_null_bdev Null0` 391 392# NVMe bdev {#bdev_config_nvme} 393 394There are two ways to create block device based on NVMe device in SPDK. First 395way is to connect local PCIe drive and second one is to connect NVMe-oF device. 396In both cases user should use `construct_nvme_bdev` RPC command to achieve that. 397 398Example commands 399 400`rpc.py construct_nvme_bdev -b NVMe1 -t PCIe -a 0000:01:00.0` 401 402This command will create NVMe bdev of physical device in the system. 403 404`rpc.py construct_nvme_bdev -b Nvme0 -t RDMA -a 192.168.100.1 -f IPv4 -s 4420 -n nqn.2016-06.io.spdk:cnode1` 405 406This command will create NVMe bdev of NVMe-oF resource. 407 408To remove a NVMe controller use the delete_nvme_controller command. 409 410`rpc.py delete_nvme_controller Nvme0` 411 412This command will remove NVMe controller named Nvme0. 413 414# Logical volumes {#bdev_ug_logical_volumes} 415 416The Logical Volumes library is a flexible storage space management system. It allows 417creating and managing virtual block devices with variable size on top of other bdevs. 418The SPDK Logical Volume library is built on top of @ref blob. For detailed description 419please refer to @ref lvol. 420 421## Logical volume store {#bdev_ug_lvol_store} 422 423Before creating any logical volumes (lvols), an lvol store has to be created first on 424selected block device. Lvol store is lvols vessel responsible for managing underlying 425bdev space assignment to lvol bdevs and storing metadata. To create lvol store user 426should use using `construct_lvol_store` RPC command. 427 428Example command 429 430`rpc.py construct_lvol_store Malloc2 lvs -c 4096` 431 432This will create lvol store named `lvs` with cluster size 4096, build on top of 433`Malloc2` bdev. In response user will be provided with uuid which is unique lvol store 434identifier. 435 436User can get list of available lvol stores using `bdev_lvol_get_lvstores` RPC command (no 437parameters available). 438 439Example response 440 441~~~ 442{ 443 "uuid": "330a6ab2-f468-11e7-983e-001e67edf35d", 444 "base_bdev": "Malloc2", 445 "free_clusters": 8190, 446 "cluster_size": 8192, 447 "total_data_clusters": 8190, 448 "block_size": 4096, 449 "name": "lvs" 450} 451~~~ 452 453To delete lvol store user should use `destroy_lvol_store` RPC command. 454 455Example commands 456 457`rpc.py destroy_lvol_store -u 330a6ab2-f468-11e7-983e-001e67edf35d` 458 459`rpc.py destroy_lvol_store -l lvs` 460 461## Lvols {#bdev_ug_lvols} 462 463To create lvols on existing lvol store user should use `construct_lvol_bdev` RPC command. 464Each created lvol will be represented by new bdev. 465 466Example commands 467 468`rpc.py construct_lvol_bdev lvol1 25 -l lvs` 469 470`rpc.py construct_lvol_bdev lvol2 25 -u 330a6ab2-f468-11e7-983e-001e67edf35d` 471 472# RAID {#bdev_ug_raid} 473 474RAID virtual bdev module provides functionality to combine any SPDK bdevs into 475one RAID bdev. Currently SPDK supports only RAID 0. RAID functionality does not 476store on-disk metadata on the member disks, so user must reconstruct the RAID 477volume when restarting application. User may specify member disks to create RAID 478volume event if they do not exists yet - as the member disks are registered at 479a later time, the RAID module will claim them and will surface the RAID volume 480after all of the member disks are available. It is allowed to use disks of 481different sizes - the smallest disk size will be the amount of space used on 482each member disk. 483 484Example commands 485 486`rpc.py construct_raid_bdev -n Raid0 -z 64 -r 0 -b "lvol0 lvol1 lvol2 lvol3"` 487 488`rpc.py get_raid_bdevs` 489 490`rpc.py destroy_raid_bdev Raid0` 491 492# Passthru {#bdev_config_passthru} 493 494The SPDK Passthru virtual block device module serves as an example of how to write a 495virtual block device module. It implements the required functionality of a vbdev module 496and demonstrates some other basic features such as the use of per I/O context. 497 498Example commands 499 500`rpc.py construct_passthru_bdev -b aio -p pt` 501 502`rpc.py delete_passthru_bdev pt` 503 504# Pmem {#bdev_config_pmem} 505 506The SPDK pmem bdev driver uses pmemblk pool as the target for block I/O operations. For 507details on Pmem memory please refer to PMDK documentation on http://pmem.io website. 508First, user needs to configure SPDK to include PMDK support: 509 510`configure --with-pmdk` 511 512To create pmemblk pool for use with SPDK user should use `create_pmem_pool` RPC command. 513 514Example command 515 516`rpc.py create_pmem_pool /path/to/pmem_pool 25 4096` 517 518To get information on created pmem pool file user can use `pmem_pool_info` RPC command. 519 520Example command 521 522`rpc.py pmem_pool_info /path/to/pmem_pool` 523 524To remove pmem pool file user can use `delete_pmem_pool` RPC command. 525 526Example command 527 528`rpc.py delete_pmem_pool /path/to/pmem_pool` 529 530To create bdev based on pmemblk pool file user should use `construct_pmem_bdev ` RPC 531command. 532 533Example command 534 535`rpc.py construct_pmem_bdev /path/to/pmem_pool -n pmem` 536 537To remove a block device representation use the delete_pmem_bdev command. 538 539`rpc.py delete_pmem_bdev pmem` 540 541# Virtio Block {#bdev_config_virtio_blk} 542 543The Virtio-Block driver allows creating SPDK bdevs from Virtio-Block devices. 544 545The following command creates a Virtio-Block device named `VirtioBlk0` from a vhost-user 546socket `/tmp/vhost.0` exposed directly by SPDK @ref vhost. Optional `vq-count` and 547`vq-size` params specify number of request queues and queue depth to be used. 548 549`rpc.py construct_virtio_dev --dev-type blk --trtype user --traddr /tmp/vhost.0 --vq-count 2 --vq-size 512 VirtioBlk0` 550 551The driver can be also used inside QEMU-based VMs. The following command creates a Virtio 552Block device named `VirtioBlk0` from a Virtio PCI device at address `0000:00:01.0`. 553The entire configuration will be read automatically from PCI Configuration Space. It will 554reflect all parameters passed to QEMU's vhost-user-scsi-pci device. 555 556`rpc.py construct_virtio_dev --dev-type blk --trtype pci --traddr 0000:01:00.0 VirtioBlk1` 557 558Virtio-Block devices can be removed with the following command 559 560`rpc.py remove_virtio_bdev VirtioBlk0` 561 562# Virtio SCSI {#bdev_config_virtio_scsi} 563 564The Virtio-SCSI driver allows creating SPDK block devices from Virtio-SCSI LUNs. 565 566Virtio-SCSI bdevs are constructed the same way as Virtio-Block ones. 567 568`rpc.py construct_virtio_dev --dev-type scsi --trtype user --traddr /tmp/vhost.0 --vq-count 2 --vq-size 512 VirtioScsi0` 569 570`rpc.py construct_virtio_dev --dev-type scsi --trtype pci --traddr 0000:01:00.0 VirtioScsi0` 571 572Each Virtio-SCSI device may export up to 64 block devices named VirtioScsi0t0 ~ VirtioScsi0t63, 573one LUN (LUN0) per SCSI device. The above 2 commands will output names of all exposed bdevs. 574 575Virtio-SCSI devices can be removed with the following command 576 577`rpc.py remove_virtio_bdev VirtioScsi0` 578 579Removing a Virtio-SCSI device will destroy all its bdevs. 580