xref: /spdk/doc/bdev_pg.md (revision 428b17a0a87f40717c58816cc9741efecd0fb0d1)
1daf33a09SBen Walker# Block Device Layer Programming Guide {#bdev_pg}
2daf33a09SBen Walker
3daf33a09SBen Walker## Target Audience
4daf33a09SBen Walker
5daf33a09SBen WalkerThis programming guide is intended for developers authoring applications that
6daf33a09SBen Walkeruse the SPDK bdev library to access block devices.
7daf33a09SBen Walker
8daf33a09SBen Walker## Introduction
9daf33a09SBen Walker
10daf33a09SBen WalkerA block device is a storage device that supports reading and writing data in
11daf33a09SBen Walkerfixed-size blocks. These blocks are usually 512 or 4096 bytes. The
12daf33a09SBen Walkerdevices may be logical constructs in software or correspond to physical
13daf33a09SBen Walkerdevices like NVMe SSDs.
14daf33a09SBen Walker
15daf33a09SBen WalkerThe block device layer consists of a single generic library in `lib/bdev`,
16daf33a09SBen Walkerplus a number of optional modules (as separate libraries) that implement
17daf33a09SBen Walkervarious types of block devices. The public header file for the generic library
18daf33a09SBen Walkeris bdev.h, which is the entirety of the API needed to interact with any type
19daf33a09SBen Walkerof block device. This guide will cover how to interact with bdevs using that
20daf33a09SBen WalkerAPI. For a guide to implementing a bdev module, see @ref bdev_module.
21daf33a09SBen Walker
22daf33a09SBen WalkerThe bdev layer provides a number of useful features in addition to providing a
23daf33a09SBen Walkercommon abstraction for all block devices:
24daf33a09SBen Walker
25daf33a09SBen Walker- Automatic queueing of I/O requests in response to queue full or out-of-memory conditions
26daf33a09SBen Walker- Hot remove support, even while I/O traffic is occurring.
27daf33a09SBen Walker- I/O statistics such as bandwidth and latency
28daf33a09SBen Walker- Device reset support and I/O timeout tracking
29daf33a09SBen Walker
30daf33a09SBen Walker## Basic Primitives
31daf33a09SBen Walker
32daf33a09SBen WalkerUsers of the bdev API interact with a number of basic objects.
33daf33a09SBen Walker
34daf33a09SBen Walkerstruct spdk_bdev, which this guide will refer to as a *bdev*, represents a
35daf33a09SBen Walkergeneric block device. struct spdk_bdev_desc, heretofore called a *descriptor*,
36daf33a09SBen Walkerrepresents a handle to a given block device. Descriptors are used to establish
37daf33a09SBen Walkerand track permissions to use the underlying block device, much like a file
38daf33a09SBen Walkerdescriptor on UNIX systems. Requests to the block device are asynchronous and
39daf33a09SBen Walkerrepresented by spdk_bdev_io objects. Requests must be submitted on an
40daf33a09SBen Walkerassociated I/O channel. The motivation and design of I/O channels is described
41daf33a09SBen Walkerin @ref concurrency.
42daf33a09SBen Walker
43daf33a09SBen WalkerBdevs can be layered, such that some bdevs service I/O by routing requests to
44daf33a09SBen Walkerother bdevs. This can be used to implement caching, RAID, logical volume
45daf33a09SBen Walkermanagement, and more. Bdevs that route I/O to other bdevs are often referred
46daf33a09SBen Walkerto as virtual bdevs, or *vbdevs* for short.
47daf33a09SBen Walker
48daf33a09SBen Walker## Initializing The Library
49daf33a09SBen Walker
50daf33a09SBen WalkerThe bdev layer depends on the generic message passing infrastructure
51d49402feSWangHaiLiangabstracted by the header file include/spdk/thread.h. See @ref concurrency for a
52daf33a09SBen Walkerfull description. Most importantly, calls into the bdev library may only be
53daf33a09SBen Walkermade from threads that have been allocated with SPDK by calling
547b3aec45SHailiang Wangspdk_thread_create().
55daf33a09SBen Walker
56daf33a09SBen WalkerFrom an allocated thread, the bdev library may be initialized by calling
57daf33a09SBen Walkerspdk_bdev_initialize(), which is an asynchronous operation. Until the completion
58daf33a09SBen Walkercallback is called, no other bdev library functions may be invoked. Similarly,
5992ebb7c8SChen Zhenghuato tear down the bdev library, call spdk_bdev_finish().
60daf33a09SBen Walker
61daf33a09SBen Walker## Discovering Block Devices
62daf33a09SBen Walker
63daf33a09SBen WalkerAll block devices have a simple string name. At any time, a pointer to the
64daf33a09SBen Walkerdevice object can be obtained by calling spdk_bdev_get_by_name(), or the entire
65daf33a09SBen Walkerset of bdevs may be iterated using spdk_bdev_first() and spdk_bdev_next() and
66*428b17a0SShuhei Matsumototheir variants or spdk_for_each_bdev() and its variant.
67daf33a09SBen Walker
68daf33a09SBen WalkerSome block devices may also be given aliases, which are also string names.
69daf33a09SBen WalkerAliases behave like symlinks - they can be used interchangeably with the real
70daf33a09SBen Walkername to look up the block device.
71daf33a09SBen Walker
72daf33a09SBen Walker## Preparing To Use A Block Device
73daf33a09SBen Walker
74daf33a09SBen WalkerIn order to send I/O requests to a block device, it must first be opened by
7579ed1ba1SMaciej Szwedcalling spdk_bdev_open_ext(). This will return a descriptor. Multiple users may have
76daf33a09SBen Walkera bdev open at the same time, and coordination of reads and writes between
77daf33a09SBen Walkerusers must be handled by some higher level mechanism outside of the bdev
78daf33a09SBen Walkerlayer. Opening a bdev with write permission may fail if a virtual bdev module
79daf33a09SBen Walkerhas *claimed* the bdev. Virtual bdev modules implement logic like RAID or
80daf33a09SBen Walkerlogical volume management and forward their I/O to lower level bdevs, so they
81daf33a09SBen Walkermark these lower level bdevs as claimed to prevent outside users from issuing
82daf33a09SBen Walkerwrites.
83daf33a09SBen Walker
8479ed1ba1SMaciej SzwedWhen a block device is opened, a callback and context must be provided that
8579ed1ba1SMaciej Szwedwill be called with appropriate spdk_bdev_event_type enum as an argument when
8679ed1ba1SMaciej Szwedthe bdev triggers asynchronous event such as bdev removal. For example,
8779ed1ba1SMaciej Szwedthe callback will be called on each open descriptor for a bdev backed by
8879ed1ba1SMaciej Szweda physical NVMe SSD when the NVMe SSD is hot-unplugged. In this case
8979ed1ba1SMaciej Szwedthe callback can be thought of as a request to close the open descriptor so
9079ed1ba1SMaciej Szwedother memory may be freed. A bdev cannot be torn down while open descriptors
9179ed1ba1SMaciej Szwedexist, so it is required that a callback is provided.
92daf33a09SBen Walker
93daf33a09SBen WalkerWhen a user is done with a descriptor, they may release it by calling
94daf33a09SBen Walkerspdk_bdev_close().
95daf33a09SBen Walker
96daf33a09SBen WalkerDescriptors may be passed to and used from multiple threads simultaneously.
97daf33a09SBen WalkerHowever, for each thread a separate I/O channel must be obtained by calling
98daf33a09SBen Walkerspdk_bdev_get_io_channel(). This will allocate the necessary per-thread
99daf33a09SBen Walkerresources to submit I/O requests to the bdev without taking locks. To release
100daf33a09SBen Walkera channel, call spdk_put_io_channel(). A descriptor cannot be closed until
101daf33a09SBen Walkerall associated channels have been destroyed.
102daf33a09SBen Walker
103daf33a09SBen Walker## Sending I/O
104daf33a09SBen Walker
105daf33a09SBen WalkerOnce a descriptor and a channel have been obtained, I/O may be sent by calling
106daf33a09SBen Walkerthe various I/O submission functions such as spdk_bdev_read(). These calls each
107daf33a09SBen Walkertake a callback as an argument which will be called some time later with a
108daf33a09SBen Walkerhandle to an spdk_bdev_io object. In response to that completion, the user
10992ebb7c8SChen Zhenghuamust call spdk_bdev_free_io() to release the resources. Within this callback,
110daf33a09SBen Walkerthe user may also use the functions spdk_bdev_io_get_nvme_status() and
111daf33a09SBen Walkerspdk_bdev_io_get_scsi_status() to obtain error information in the format of
112daf33a09SBen Walkertheir choosing.
113daf33a09SBen Walker
114daf33a09SBen WalkerI/O submission is performed by calling functions such as spdk_bdev_read() or
115daf33a09SBen Walkerspdk_bdev_write(). These functions take as an argument a pointer to a region of
116daf33a09SBen Walkermemory or a scatter gather list describing memory that will be transferred to
117daf33a09SBen Walkerthe block device. This memory must be allocated through spdk_dma_malloc() or
118daf33a09SBen Walkerits variants. For a full explanation of why the memory must come from a
119daf33a09SBen Walkerspecial allocation pool, see @ref memory. Where possible, data in memory will
120daf33a09SBen Walkerbe *directly transferred to the block device* using
121daf33a09SBen Walker[Direct Memory Access](https://en.wikipedia.org/wiki/Direct_memory_access).
122daf33a09SBen WalkerThat means it is not copied.
123daf33a09SBen Walker
124daf33a09SBen WalkerAll I/O submission functions are asynchronous and non-blocking. They will not
125daf33a09SBen Walkerblock or stall the thread for any reason. However, the I/O submission
126daf33a09SBen Walkerfunctions may fail in one of two ways. First, they may fail immediately and
127daf33a09SBen Walkerreturn an error code. In that case, the provided callback will not be called.
128daf33a09SBen WalkerSecond, they may fail asynchronously. In that case, the associated
129daf33a09SBen Walkerspdk_bdev_io will be passed to the callback and it will report error
130daf33a09SBen Walkerinformation.
131daf33a09SBen Walker
132daf33a09SBen WalkerSome I/O request types are optional and may not be supported by a given bdev.
133daf33a09SBen WalkerTo query a bdev for the I/O request types it supports, call
134daf33a09SBen Walkerspdk_bdev_io_type_supported().
135daf33a09SBen Walker
136daf33a09SBen Walker## Resetting A Block Device
137daf33a09SBen Walker
138daf33a09SBen WalkerIn order to handle unexpected failure conditions, the bdev library provides a
139daf33a09SBen Walkermechanism to perform a device reset by calling spdk_bdev_reset(). This will pass
140daf33a09SBen Walkera message to every other thread for which an I/O channel exists for the bdev,
141daf33a09SBen Walkerpause it, then forward a reset request to the underlying bdev module and wait
142daf33a09SBen Walkerfor completion. Upon completion, the I/O channels will resume and the reset
143daf33a09SBen Walkerwill complete. The specific behavior inside the bdev module is
144daf33a09SBen Walkermodule-specific. For example, NVMe devices will delete all queue pairs,
145daf33a09SBen Walkerperform an NVMe reset, then recreate the queue pairs and continue. Most
146daf33a09SBen Walkerimportantly, regardless of device type, *all I/O outstanding to the block
147daf33a09SBen Walkerdevice will be completed prior to the reset completing.*
148