xref: /spdk/doc/bdev.md (revision 407e88fd2ab020d753e33014cf759353a9901b51)
1# Block Device User Guide {#bdev}
2
3# Introduction {#bdev_ug_introduction}
4
5The SPDK block device layer, often simply called *bdev*, is a C library
6intended to be equivalent to the operating system block storage layer that
7often sits immediately above the device drivers in a traditional kernel
8storage stack. Specifically, this library provides the following
9functionality:
10
11* A pluggable module API for implementing block devices that interface with different types of block storage devices.
12* Driver modules for NVMe, malloc (ramdisk), Linux AIO, virtio-scsi, Ceph RBD, Pmem and Vhost-SCSI Initiator and more.
13* An application API for enumerating and claiming SPDK block devices and then performing operations (read, write, unmap, etc.) on those devices.
14* Facilities to stack block devices to create complex I/O pipelines, including logical volume management (lvol) and partition support (GPT).
15* Configuration of block devices via JSON-RPC.
16* Request queueing, timeout, and reset handling.
17* Multiple, lockless queues for sending I/O to block devices.
18
19Bdev module creates abstraction layer that provides common API for all devices.
20User can use available bdev modules or create own module with any type of
21device underneath (please refer to @ref bdev_module for details). SPDK
22provides also vbdev modules which creates block devices on existing bdev. For
23example @ref bdev_ug_logical_volumes or @ref bdev_ug_gpt
24
25# Prerequisites {#bdev_ug_prerequisites}
26
27This guide assumes that you can already build the standard SPDK distribution
28on your platform. The block device layer is a C library with a single public
29header file named bdev.h. All SPDK configuration described in following
30chapters is done by using JSON-RPC commands. SPDK provides a python-based
31command line tool for sending RPC commands located at `scripts/rpc.py`. User
32can list available commands by running this script with `-h` or `--help` flag.
33Additionally user can retrieve currently supported set of RPC commands
34directly from SPDK application by running `scripts/rpc.py rpc_get_methods`.
35Detailed help for each command can be displayed by adding `-h` flag as a
36command parameter.
37
38# General Purpose RPCs {#bdev_ug_general_rpcs}
39
40## get_bdevs {#bdev_ug_get_bdevs}
41
42List of currently available block devices including detailed information about
43them can be get by using `get_bdevs` RPC command. User can add optional
44parameter `name` to get details about specified by that name bdev.
45
46Example response
47
48~~~
49{
50  "num_blocks": 32768,
51  "assigned_rate_limits": {
52    "rw_ios_per_sec": 10000,
53    "rw_mbytes_per_sec": 20
54  },
55  "supported_io_types": {
56    "reset": true,
57    "nvme_admin": false,
58    "unmap": true,
59    "read": true,
60    "write_zeroes": true,
61    "write": true,
62    "flush": true,
63    "nvme_io": false
64  },
65  "driver_specific": {},
66  "claimed": false,
67  "block_size": 4096,
68  "product_name": "Malloc disk",
69  "name": "Malloc0"
70}
71~~~
72
73## set_bdev_qos_limit {#set_bdev_qos_limit}
74
75Users can use the `set_bdev_qos_limit` RPC command to enable, adjust, and disable
76rate limits on an existing bdev.  Two types of rate limits are supported:
77IOPS and bandwidth.  The rate limits can be enabled, adjusted, and disabled at any
78time for the specified bdev.  The bdev name is a required parameter for this
79RPC command and at least one of `rw_ios_per_sec` and `rw_mbytes_per_sec` must be
80specified.  When both rate limits are enabled, the first met limit will
81take effect.  The value 0 may be specified to disable the corresponding rate
82limit. Users can run this command with `-h` or `--help` for more information.
83
84## Histograms {#rpc_bdev_histogram}
85
86The `enable_bdev_histogram` RPC command allows to enable or disable gathering
87latency data for specified bdev. Histogram can be downloaded by the user by
88calling `get_bdev_histogram` and parsed using scripts/histogram.py script.
89
90Example command
91
92`rpc.py enable_bdev_histogram Nvme0n1 --enable`
93
94The command will enable gathering data for histogram on Nvme0n1 device.
95
96`rpc.py get_bdev_histogram Nvme0n1 | histogram.py`
97
98The command will download gathered histogram data. The script will parse
99the data and show table containing IO count for latency ranges.
100
101`rpc.py enable_bdev_histogram Nvme0n1 --disable`
102
103The command will disable histogram on Nvme0n1 device.
104
105# Ceph RBD {#bdev_config_rbd}
106
107The SPDK RBD bdev driver provides SPDK block layer access to Ceph RADOS block
108devices (RBD). Ceph RBD devices are accessed via librbd and librados libraries
109to access the RADOS block device exported by Ceph. To create Ceph bdev RPC
110command `construct_rbd_bdev` should be used.
111
112Example command
113
114`rpc.py construct_rbd_bdev rbd foo 512`
115
116This command will create a bdev that represents the 'foo' image from a pool called 'rbd'.
117
118To remove a block device representation use the delete_rbd_bdev command.
119
120`rpc.py delete_rbd_bdev Rbd0`
121
122# Compression Virtual Bdev Module {#bdev_config_compress}
123
124The compression bdev module can be configured to provide compression/decompression
125services for an underlying thinly provisioned logical volume. Although the underlying
126module can be anything (i.e. NVME bdev) the overall compression benefits will not be realized
127unless the data stored on disk is placed appropriately. The compression vbdev module
128relies on an internal SPDK library called `reduce` to accomplish this, see @ref reduce
129for detailed information.
130
131The vbdev module relies on the DPDK CompressDev Framework to provide all compression
132functionality. The framework provides support for many different software only
133compression modules as well as hardware assisted support for Intel QAT. At this
134time the vbdev module supports the DPDK drivers for ISAL and QAT.
135
136Persistent memory is used to store metadata associated with the layout of the data on the
137backing device. SPDK relies on [PMDK](http://pmem.io/pmdk/) to interface persistent memory so any hardware
138supported by PMDK should work. If the directory for PMEM supplied upon vbdev creation does
139not point to persistent memory (i.e. a regular filesystem) performance will be severely
140impacted.  The vbdev module and reduce libraries were designed to use persistent memory for
141any production use.
142
143Example command
144
145`rpc.py bdev_compress_create -p /pmem_files -b myLvol`
146
147In this example, a compression vbdev is created using persistent memory that is mapped to
148the directory `pmem_files` on top of the existing thinly provisioned logical volume `myLvol`.
149The resulting compression bdev will be named `COMP_LVS/myLvol` where LVS is the name of the
150logical volume store that `myLvol` resides on.
151
152The logical volume is referred to as the backing device and once the compression vbdev is
153created it cannot be separated from the persistent memory file that will be created in
154the specified directory.  If the persistent memory file is not available, the compression
155vbdev will also not be available.
156
157By default the vbdev module will choose the QAT driver if the hardware and drivers are
158available and loaded.  If not, it will revert to the software-only ISAL driver. By using
159the following command, the driver may be specified however this is not persistent so it
160must be done either upon creation or before the underlying logical volume is loaded to
161be honored. In the example below, `0` is telling the vbdev module to use QAT if available
162otherwise use ISAL, this is the default and if sufficient the command is not required. Passing
163a value of 1 tells the driver to use QAT and if not available then the creation or loading
164the vbdev should fail to create or load.  A value of '2' as shown below tells the module
165to use ISAL and if for some reason it is not available, the vbdev should fail to create or load.
166
167`rpc.py set_compress_pmd -p 2`
168
169To remove a compression vbdev, use the following command which will also delete the PMEM
170file.  If the logical volume is deleted the PMEM file will not be removed and the
171compression vbdev will not be available.
172
173`rpc.py bdev_compress_delete COMP_LVS/myLvol`
174
175To list compression volumes that are only available for deletion because their PMEM file
176was missing use the following. The name parameter is optional and if not included will list
177all volumes, if used it will return the name or an error that the device does not exist.
178
179`rpc.py bdev_compress_get_orphans --name COMP_Nvme0n1`
180
181# Crypto Virtual Bdev Module {#bdev_config_crypto}
182
183The crypto virtual bdev module can be configured to provide at rest data encryption
184for any underlying bdev. The module relies on the DPDK CryptoDev Framework to provide
185all cryptographic functionality. The framework provides support for many different software
186only cryptographic modules as well hardware assisted support for the Intel QAT board. The
187framework also provides support for cipher, hash, authentication and AEAD functions. At this
188time the SPDK virtual bdev module supports cipher only as follows:
189
190- AESN-NI Multi Buffer Crypto Poll Mode Driver: RTE_CRYPTO_CIPHER_AES128_CBC
191- Intel(R) QuickAssist (QAT) Crypto Poll Mode Driver: RTE_CRYPTO_CIPHER_AES128_CBC
192(Note: QAT is functional however is marked as experimental until the hardware has
193been fully integrated with the SPDK CI system.)
194
195In order to support using the bdev block offset (LBA) as the initialization vector (IV),
196the crypto module break up all I/O into crypto operations of a size equal to the block
197size of the underlying bdev.  For example, a 4K I/O to a bdev with a 512B block size,
198would result in 8 cryptographic operations.
199
200For reads, the buffer provided to the crypto module will be used as the destination buffer
201for unencrypted data.  For writes, however, a temporary scratch buffer is used as the
202destination buffer for encryption which is then passed on to the underlying bdev as the
203write buffer.  This is done to avoid encrypting the data in the original source buffer which
204may cause problems in some use cases.
205
206Example command
207
208`rpc.py bdev_crypto_create NVMe1n1 CryNvmeA crypto_aesni_mb 0123456789123456`
209
210This command will create a crypto vbdev called 'CryNvmeA' on top of the NVMe bdev
211'NVMe1n1' and will use the DPDK software driver 'crypto_aesni_mb' and the key
212'0123456789123456'.
213
214To remove the vbdev use the bdev_crypto_delete command.
215
216`rpc.py bdev_crypto_delete CryNvmeA`
217
218# Delay Bdev Module {#bdev_config_delay}
219
220The delay vbdev module is intended to apply a predetermined additional latency on top of a lower
221level bdev. This enables the simulation of the latency characteristics of a device during the functional
222or scalability testing of an SPDK application. For example, to simulate the effect of drive latency when
223processing I/Os, one could configure a NULL bdev with a delay bdev on top of it.
224
225The delay bdev module is not intended to provide a high fidelity replication of a specific NVMe drive's latency,
226instead it's main purpose is to provide a "big picture" understanding of how a generic latency affects a given
227application.
228
229A delay bdev is created using the `bdev_delay_create` RPC. This rpc takes 6 arguments, one for the name
230of the delay bdev and one for the name of the base bdev. The remaining four arguments represent the following
231latency values: average read latency, average write latency, p99 read latency, and p99 write latency.
232Within the context of the delay bdev p99 latency means that one percent of the I/O will be delayed by at
233least by the value of the p99 latency before being completed to the upper level protocol. All of the latency values
234are measured in microseconds.
235
236Example command:
237
238`rpc.py bdev_delay_create -b Null0 -d delay0 -r 10 --nine-nine-read-latency 50 -w 30 --nine-nine-write-latency 90`
239
240This command will create a delay bdev with average read and write latencies of 10 and 30 microseconds and p99 read
241and write latencies of 50 and 90 microseconds respectively.
242
243A delay bdev can be deleted using the `bdev_delay_delete` RPC
244
245Example command:
246
247`rpc.py bdev_delay_delete delay0`
248
249# GPT (GUID Partition Table) {#bdev_config_gpt}
250
251The GPT virtual bdev driver is enabled by default and does not require any configuration.
252It will automatically detect @ref bdev_ug_gpt on any attached bdev and will create
253possibly multiple virtual bdevs.
254
255## SPDK GPT partition table {#bdev_ug_gpt}
256
257The SPDK partition type GUID is `7c5222bd-8f5d-4087-9c00-bf9843c7b58c`. Existing SPDK bdevs
258can be exposed as Linux block devices via NBD and then ca be partitioned with
259standard partitioning tools. After partitioning, the bdevs will need to be deleted and
260attached again for the GPT bdev module to see any changes. NBD kernel module must be
261loaded first. To create NBD bdev user should use `start_nbd_disk` RPC command.
262
263Example command
264
265`rpc.py start_nbd_disk Malloc0 /dev/nbd0`
266
267This will expose an SPDK bdev `Malloc0` under the `/dev/nbd0` block device.
268
269To remove NBD device user should use `stop_nbd_disk` RPC command.
270
271Example command
272
273`rpc.py stop_nbd_disk /dev/nbd0`
274
275To display full or specified nbd device list user should use `get_nbd_disks` RPC command.
276
277Example command
278
279`rpc.py stop_nbd_disk -n /dev/nbd0`
280
281## Creating a GPT partition table using NBD {#bdev_ug_gpt_create_part}
282
283~~~
284# Expose bdev Nvme0n1 as kernel block device /dev/nbd0 by JSON-RPC
285rpc.py start_nbd_disk Nvme0n1 /dev/nbd0
286
287# Create GPT partition table.
288parted -s /dev/nbd0 mklabel gpt
289
290# Add a partition consuming 50% of the available space.
291parted -s /dev/nbd0 mkpart MyPartition '0%' '50%'
292
293# Change the partition type to the SPDK GUID.
294# sgdisk is part of the gdisk package.
295sgdisk -t 1:7c5222bd-8f5d-4087-9c00-bf9843c7b58c /dev/nbd0
296
297# Stop the NBD device (stop exporting /dev/nbd0).
298rpc.py stop_nbd_disk /dev/nbd0
299
300# Now Nvme0n1 is configured with a GPT partition table, and
301# the first partition will be automatically exposed as
302# Nvme0n1p1 in SPDK applications.
303~~~
304
305# iSCSI bdev {#bdev_config_iscsi}
306
307The SPDK iSCSI bdev driver depends on libiscsi and hence is not enabled by default.
308In order to use it, build SPDK with an extra `--with-iscsi-initiator` configure option.
309
310The following command creates an `iSCSI0` bdev from a single LUN exposed at given iSCSI URL
311with `iqn.2016-06.io.spdk:init` as the reported initiator IQN.
312
313`rpc.py bdev_iscsi_create -b iSCSI0 -i iqn.2016-06.io.spdk:init --url iscsi://127.0.0.1/iqn.2016-06.io.spdk:disk1/0`
314
315The URL is in the following format:
316`iscsi://[<username>[%<password>]@]<host>[:<port>]/<target-iqn>/<lun>`
317
318# Linux AIO bdev {#bdev_config_aio}
319
320The SPDK AIO bdev driver provides SPDK block layer access to Linux kernel block
321devices or a file on a Linux filesystem via Linux AIO. Note that O_DIRECT is
322used and thus bypasses the Linux page cache. This mode is probably as close to
323a typical kernel based target as a user space target can get without using a
324user-space driver. To create AIO bdev RPC command `bdev_aio_create` should be
325used.
326
327Example commands
328
329`rpc.py bdev_aio_create /dev/sda aio0`
330
331This command will create `aio0` device from /dev/sda.
332
333`rpc.py bdev_aio_create /tmp/file file 8192`
334
335This command will create `file` device with block size 8192 from /tmp/file.
336
337To delete an aio bdev use the bdev_aio_delete command.
338
339`rpc.py bdev_aio_delete aio0`
340
341# OCF Virtual bdev {#bdev_config_cas}
342
343OCF virtual bdev module is based on [Open CAS Framework](https://github.com/Open-CAS/ocf) - a
344high performance block storage caching meta-library.
345To enable the module, configure SPDK using `--with-ocf` flag.
346OCF bdev can be used to enable caching for any underlying bdev.
347
348Below is an example command for creating OCF bdev:
349
350`rpc.py construct_ocf_bdev Cache1 wt Malloc0 Nvme0n1`
351
352This command will create new OCF bdev `Cache1` having bdev `Malloc0` as caching-device
353and `Nvme0n1` as core-device and initial cache mode `Write-Through`.
354`Malloc0` will be used as cache for `Nvme0n1`, so  data written to `Cache1` will be present
355on `Nvme0n1` eventually.
356By default, OCF will be configured with cache line size equal 4KiB
357and non-volatile metadata will be disabled.
358
359To remove `Cache1`:
360
361`rpc.py delete_ocf_bdev Cache1`
362
363During removal OCF-cache will be stopped and all cached data will be written to the core device.
364
365Note that OCF has a per-device RAM requirement
366of about 56000 + _cache device size_ * 58 / _cache line size_ (in bytes).
367To get more information on OCF
368please visit [OCF documentation](https://open-cas.github.io/).
369
370# Malloc bdev {#bdev_config_malloc}
371
372Malloc bdevs are ramdisks. Because of its nature they are volatile. They are created from hugepage memory given to SPDK
373application.
374
375# Null {#bdev_config_null}
376
377The SPDK null bdev driver is a dummy block I/O target that discards all writes and returns undefined
378data for reads.  It is useful for benchmarking the rest of the bdev I/O stack with minimal block
379device overhead and for testing configurations that can't easily be created with the Malloc bdev.
380To create Null bdev RPC command `construct_null_bdev` should be used.
381
382Example command
383
384`rpc.py construct_null_bdev Null0 8589934592 4096`
385
386This command will create an 8 petabyte `Null0` device with block size 4096.
387
388To delete a null bdev use the delete_null_bdev command.
389
390`rpc.py delete_null_bdev Null0`
391
392# NVMe bdev {#bdev_config_nvme}
393
394There are two ways to create block device based on NVMe device in SPDK. First
395way is to connect local PCIe drive and second one is to connect NVMe-oF device.
396In both cases user should use `construct_nvme_bdev` RPC command to achieve that.
397
398Example commands
399
400`rpc.py construct_nvme_bdev -b NVMe1 -t PCIe -a 0000:01:00.0`
401
402This command will create NVMe bdev of physical device in the system.
403
404`rpc.py construct_nvme_bdev -b Nvme0 -t RDMA -a 192.168.100.1 -f IPv4 -s 4420 -n nqn.2016-06.io.spdk:cnode1`
405
406This command will create NVMe bdev of NVMe-oF resource.
407
408To remove a NVMe controller use the delete_nvme_controller command.
409
410`rpc.py delete_nvme_controller Nvme0`
411
412This command will remove NVMe controller named Nvme0.
413
414# Logical volumes {#bdev_ug_logical_volumes}
415
416The Logical Volumes library is a flexible storage space management system. It allows
417creating and managing virtual block devices with variable size on top of other bdevs.
418The SPDK Logical Volume library is built on top of @ref blob. For detailed description
419please refer to @ref lvol.
420
421## Logical volume store {#bdev_ug_lvol_store}
422
423Before creating any logical volumes (lvols), an lvol store has to be created first on
424selected block device. Lvol store is lvols vessel responsible for managing underlying
425bdev space assignment to lvol bdevs and storing metadata. To create lvol store user
426should use using `construct_lvol_store` RPC command.
427
428Example command
429
430`rpc.py construct_lvol_store Malloc2 lvs -c 4096`
431
432This will create lvol store named `lvs` with cluster size 4096, build on top of
433`Malloc2` bdev. In response user will be provided with uuid which is unique lvol store
434identifier.
435
436User can get list of available lvol stores using `bdev_lvol_get_lvstores` RPC command (no
437parameters available).
438
439Example response
440
441~~~
442{
443  "uuid": "330a6ab2-f468-11e7-983e-001e67edf35d",
444  "base_bdev": "Malloc2",
445  "free_clusters": 8190,
446  "cluster_size": 8192,
447  "total_data_clusters": 8190,
448  "block_size": 4096,
449  "name": "lvs"
450}
451~~~
452
453To delete lvol store user should use `destroy_lvol_store` RPC command.
454
455Example commands
456
457`rpc.py destroy_lvol_store -u 330a6ab2-f468-11e7-983e-001e67edf35d`
458
459`rpc.py destroy_lvol_store -l lvs`
460
461## Lvols {#bdev_ug_lvols}
462
463To create lvols on existing lvol store user should use `construct_lvol_bdev` RPC command.
464Each created lvol will be represented by new bdev.
465
466Example commands
467
468`rpc.py construct_lvol_bdev lvol1 25 -l lvs`
469
470`rpc.py construct_lvol_bdev lvol2 25 -u 330a6ab2-f468-11e7-983e-001e67edf35d`
471
472# RAID {#bdev_ug_raid}
473
474RAID virtual bdev module provides functionality to combine any SPDK bdevs into
475one RAID bdev. Currently SPDK supports only RAID 0. RAID functionality does not
476store on-disk metadata on the member disks, so user must reconstruct the RAID
477volume when restarting application. User may specify member disks to create RAID
478volume event if they do not exists yet - as the member disks are registered at
479a later time, the RAID module will claim them and will surface the RAID volume
480after all of the member disks are available. It is allowed to use disks of
481different sizes - the smallest disk size will be the amount of space used on
482each member disk.
483
484Example commands
485
486`rpc.py construct_raid_bdev -n Raid0 -z 64 -r 0 -b "lvol0 lvol1 lvol2 lvol3"`
487
488`rpc.py get_raid_bdevs`
489
490`rpc.py destroy_raid_bdev Raid0`
491
492# Passthru {#bdev_config_passthru}
493
494The SPDK Passthru virtual block device module serves as an example of how to write a
495virtual block device module. It implements the required functionality of a vbdev module
496and demonstrates some other basic features such as the use of per I/O context.
497
498Example commands
499
500`rpc.py construct_passthru_bdev -b aio -p pt`
501
502`rpc.py delete_passthru_bdev pt`
503
504# Pmem {#bdev_config_pmem}
505
506The SPDK pmem bdev driver uses pmemblk pool as the target for block I/O operations. For
507details on Pmem memory please refer to PMDK documentation on http://pmem.io website.
508First, user needs to configure SPDK to include PMDK support:
509
510`configure --with-pmdk`
511
512To create pmemblk pool for use with SPDK user should use `create_pmem_pool` RPC command.
513
514Example command
515
516`rpc.py create_pmem_pool /path/to/pmem_pool 25 4096`
517
518To get information on created pmem pool file user can use `pmem_pool_info` RPC command.
519
520Example command
521
522`rpc.py pmem_pool_info /path/to/pmem_pool`
523
524To remove pmem pool file user can use `delete_pmem_pool` RPC command.
525
526Example command
527
528`rpc.py delete_pmem_pool /path/to/pmem_pool`
529
530To create bdev based on pmemblk pool file user should use `construct_pmem_bdev ` RPC
531command.
532
533Example command
534
535`rpc.py construct_pmem_bdev /path/to/pmem_pool -n pmem`
536
537To remove a block device representation use the delete_pmem_bdev command.
538
539`rpc.py delete_pmem_bdev pmem`
540
541# Virtio Block {#bdev_config_virtio_blk}
542
543The Virtio-Block driver allows creating SPDK bdevs from Virtio-Block devices.
544
545The following command creates a Virtio-Block device named `VirtioBlk0` from a vhost-user
546socket `/tmp/vhost.0` exposed directly by SPDK @ref vhost. Optional `vq-count` and
547`vq-size` params specify number of request queues and queue depth to be used.
548
549`rpc.py construct_virtio_dev --dev-type blk --trtype user --traddr /tmp/vhost.0 --vq-count 2 --vq-size 512 VirtioBlk0`
550
551The driver can be also used inside QEMU-based VMs. The following command creates a Virtio
552Block device named `VirtioBlk0` from a Virtio PCI device at address `0000:00:01.0`.
553The entire configuration will be read automatically from PCI Configuration Space. It will
554reflect all parameters passed to QEMU's vhost-user-scsi-pci device.
555
556`rpc.py construct_virtio_dev --dev-type blk --trtype pci --traddr 0000:01:00.0 VirtioBlk1`
557
558Virtio-Block devices can be removed with the following command
559
560`rpc.py remove_virtio_bdev VirtioBlk0`
561
562# Virtio SCSI {#bdev_config_virtio_scsi}
563
564The Virtio-SCSI driver allows creating SPDK block devices from Virtio-SCSI LUNs.
565
566Virtio-SCSI bdevs are constructed the same way as Virtio-Block ones.
567
568`rpc.py construct_virtio_dev --dev-type scsi --trtype user --traddr /tmp/vhost.0 --vq-count 2 --vq-size 512 VirtioScsi0`
569
570`rpc.py construct_virtio_dev --dev-type scsi --trtype pci --traddr 0000:01:00.0 VirtioScsi0`
571
572Each Virtio-SCSI device may export up to 64 block devices named VirtioScsi0t0 ~ VirtioScsi0t63,
573one LUN (LUN0) per SCSI device. The above 2 commands will output names of all exposed bdevs.
574
575Virtio-SCSI devices can be removed with the following command
576
577`rpc.py remove_virtio_bdev VirtioScsi0`
578
579Removing a Virtio-SCSI device will destroy all its bdevs.
580