1# Block Device Layer Programming Guide {#bdev_pg} 2 3## Target Audience 4 5This programming guide is intended for developers authoring applications that 6use the SPDK bdev library to access block devices. 7 8## Introduction 9 10A block device is a storage device that supports reading and writing data in 11fixed-size blocks. These blocks are usually 512 or 4096 bytes. The 12devices may be logical constructs in software or correspond to physical 13devices like NVMe SSDs. 14 15The block device layer consists of a single generic library in `lib/bdev`, 16plus a number of optional modules (as separate libraries) that implement 17various types of block devices. The public header file for the generic library 18is bdev.h, which is the entirety of the API needed to interact with any type 19of block device. This guide will cover how to interact with bdevs using that 20API. For a guide to implementing a bdev module, see @ref bdev_module. 21 22The bdev layer provides a number of useful features in addition to providing a 23common abstraction for all block devices: 24 25- Automatic queueing of I/O requests in response to queue full or out-of-memory conditions 26- Hot remove support, even while I/O traffic is occurring. 27- I/O statistics such as bandwidth and latency 28- Device reset support and I/O timeout tracking 29 30## Basic Primitives 31 32Users of the bdev API interact with a number of basic objects. 33 34struct spdk_bdev, which this guide will refer to as a *bdev*, represents a 35generic block device. struct spdk_bdev_desc, heretofore called a *descriptor*, 36represents a handle to a given block device. Descriptors are used to establish 37and track permissions to use the underlying block device, much like a file 38descriptor on UNIX systems. Requests to the block device are asynchronous and 39represented by spdk_bdev_io objects. Requests must be submitted on an 40associated I/O channel. The motivation and design of I/O channels is described 41in @ref concurrency. 42 43Bdevs can be layered, such that some bdevs service I/O by routing requests to 44other bdevs. This can be used to implement caching, RAID, logical volume 45management, and more. Bdevs that route I/O to other bdevs are often referred 46to as virtual bdevs, or *vbdevs* for short. 47 48## Initializing The Library 49 50The bdev layer depends on the generic message passing infrastructure 51abstracted by the header file include/spdk/thread.h. See @ref concurrency for a 52full description. Most importantly, calls into the bdev library may only be 53made from threads that have been allocated with SPDK by calling 54spdk_allocate_thread(). 55 56From an allocated thread, the bdev library may be initialized by calling 57spdk_bdev_initialize(), which is an asynchronous operation. Until the completion 58callback is called, no other bdev library functions may be invoked. Similarly, 59to tear down the bdev library, call spdk_bdev_finish(). 60 61## Discovering Block Devices 62 63All block devices have a simple string name. At any time, a pointer to the 64device object can be obtained by calling spdk_bdev_get_by_name(), or the entire 65set of bdevs may be iterated using spdk_bdev_first() and spdk_bdev_next() and 66their variants. 67 68Some block devices may also be given aliases, which are also string names. 69Aliases behave like symlinks - they can be used interchangeably with the real 70name to look up the block device. 71 72## Preparing To Use A Block Device 73 74In order to send I/O requests to a block device, it must first be opened by 75calling spdk_bdev_open(). This will return a descriptor. Multiple users may have 76a bdev open at the same time, and coordination of reads and writes between 77users must be handled by some higher level mechanism outside of the bdev 78layer. Opening a bdev with write permission may fail if a virtual bdev module 79has *claimed* the bdev. Virtual bdev modules implement logic like RAID or 80logical volume management and forward their I/O to lower level bdevs, so they 81mark these lower level bdevs as claimed to prevent outside users from issuing 82writes. 83 84When a block device is opened, an optional callback and context can be 85provided that will be called if the underlying storage servicing the block 86device is removed. For example, the remove callback will be called on each 87open descriptor for a bdev backed by a physical NVMe SSD when the NVMe SSD is 88hot-unplugged. The callback can be thought of as a request to close the open 89descriptor so other memory may be freed. A bdev cannot be torn down while open 90descriptors exist, so it is highly recommended that a callback is provided. 91 92When a user is done with a descriptor, they may release it by calling 93spdk_bdev_close(). 94 95Descriptors may be passed to and used from multiple threads simultaneously. 96However, for each thread a separate I/O channel must be obtained by calling 97spdk_bdev_get_io_channel(). This will allocate the necessary per-thread 98resources to submit I/O requests to the bdev without taking locks. To release 99a channel, call spdk_put_io_channel(). A descriptor cannot be closed until 100all associated channels have been destroyed. 101 102## Sending I/O 103 104Once a descriptor and a channel have been obtained, I/O may be sent by calling 105the various I/O submission functions such as spdk_bdev_read(). These calls each 106take a callback as an argument which will be called some time later with a 107handle to an spdk_bdev_io object. In response to that completion, the user 108must call spdk_bdev_free_io() to release the resources. Within this callback, 109the user may also use the functions spdk_bdev_io_get_nvme_status() and 110spdk_bdev_io_get_scsi_status() to obtain error information in the format of 111their choosing. 112 113I/O submission is performed by calling functions such as spdk_bdev_read() or 114spdk_bdev_write(). These functions take as an argument a pointer to a region of 115memory or a scatter gather list describing memory that will be transferred to 116the block device. This memory must be allocated through spdk_dma_malloc() or 117its variants. For a full explanation of why the memory must come from a 118special allocation pool, see @ref memory. Where possible, data in memory will 119be *directly transferred to the block device* using 120[Direct Memory Access](https://en.wikipedia.org/wiki/Direct_memory_access). 121That means it is not copied. 122 123All I/O submission functions are asynchronous and non-blocking. They will not 124block or stall the thread for any reason. However, the I/O submission 125functions may fail in one of two ways. First, they may fail immediately and 126return an error code. In that case, the provided callback will not be called. 127Second, they may fail asynchronously. In that case, the associated 128spdk_bdev_io will be passed to the callback and it will report error 129information. 130 131Some I/O request types are optional and may not be supported by a given bdev. 132To query a bdev for the I/O request types it supports, call 133spdk_bdev_io_type_supported(). 134 135## Resetting A Block Device 136 137In order to handle unexpected failure conditions, the bdev library provides a 138mechanism to perform a device reset by calling spdk_bdev_reset(). This will pass 139a message to every other thread for which an I/O channel exists for the bdev, 140pause it, then forward a reset request to the underlying bdev module and wait 141for completion. Upon completion, the I/O channels will resume and the reset 142will complete. The specific behavior inside the bdev module is 143module-specific. For example, NVMe devices will delete all queue pairs, 144perform an NVMe reset, then recreate the queue pairs and continue. Most 145importantly, regardless of device type, *all I/O outstanding to the block 146device will be completed prior to the reset completing.* 147