1daf33a09SBen Walker# Writing a Custom Block Device Module {#bdev_module} 2daf33a09SBen Walker 3daf33a09SBen Walker## Target Audience 4daf33a09SBen Walker 5daf33a09SBen WalkerThis programming guide is intended for developers authoring their own block 6daf33a09SBen Walkerdevice modules to integrate with SPDK's bdev layer. For a guide on how to use 7daf33a09SBen Walkerthe bdev layer, see @ref bdev_pg. 8daf33a09SBen Walker 9daf33a09SBen Walker## Introduction 10daf33a09SBen Walker 11daf33a09SBen WalkerA block device module is SPDK's equivalent of a device driver in a traditional 12daf33a09SBen Walkeroperating system. The module provides a set of function pointers that are 13daf33a09SBen Walkercalled to service block device I/O requests. SPDK provides a number of block 14daf33a09SBen Walkerdevice modules including NVMe, RAM-disk, and Ceph RBD. However, some users 15daf33a09SBen Walkerwill want to write their own to interact with either custom hardware or to an 16daf33a09SBen Walkerexisting storage software stack. This guide is intended to demonstrate exactly 17daf33a09SBen Walkerhow to write a module. 18daf33a09SBen Walker 19daf33a09SBen Walker## Creating A New Module 20daf33a09SBen Walker 217a660b30SMonica KenguvaBlock device modules are located in subdirectories under module/bdev today. It is not 22daf33a09SBen Walkercurrently possible to place the code for a bdev module elsewhere, but updates 23daf33a09SBen Walkerto the build system could be made to enable this in the future. To create a 24daf33a09SBen Walkermodule, add a new directory with a single C file and a Makefile. A great 25daf33a09SBen Walkerstarting point is to copy the existing 'null' bdev module. 26daf33a09SBen Walker 27daf33a09SBen WalkerThe primary interface that bdev modules will interact with is in 288632afe7SJohn Kariukiinclude/spdk/bdev_module.h. In that header a macro is defined that registers 294d367354SPawel Wodkowskia new bdev module - SPDK_BDEV_MODULE_REGISTER. This macro take as argument a 3019100ed5SDaniel Verkamppointer spdk_bdev_module structure that is used to register new bdev module. 314d367354SPawel Wodkowski 3219100ed5SDaniel VerkampThe spdk_bdev_module structure describes the module properties like 334d367354SPawel Wodkowskiinitialization (`module_init`) and teardown (`module_fini`) functions, 344d367354SPawel Wodkowskithe function that returns context size (`get_ctx_size`) - scratch space that 354d367354SPawel Wodkowskiwill be allocated in each I/O request for use by this module, and a callback 364d367354SPawel Wodkowskithat will be called each time a new bdev is registered by another module 370497ae8eSPiotr Pelplinski(`examine_config` and `examine_disk`). Please check the documentation of 380497ae8eSPiotr Pelplinskistruct spdk_bdev_module for more details. 39daf33a09SBen Walker 40daf33a09SBen Walker## Creating Bdevs 41daf33a09SBen Walker 42daf33a09SBen WalkerNew bdevs are created within the module by calling spdk_bdev_register(). The 43daf33a09SBen Walkermodule must allocate a struct spdk_bdev, fill it out appropriately, and pass 44daf33a09SBen Walkerit to the register call. The most important field to fill out is `fn_table`, 45daf33a09SBen Walkerwhich points at this data structure: 46daf33a09SBen Walker 47daf33a09SBen Walker~~~{.c} 48daf33a09SBen Walker/* 49daf33a09SBen Walker * Function table for a block device backend. 50daf33a09SBen Walker * 51daf33a09SBen Walker * The backend block device function table provides a set of APIs to allow 52daf33a09SBen Walker * communication with a backend. The main commands are read/write API 53daf33a09SBen Walker * calls for I/O via submit_request. 54daf33a09SBen Walker */ 55daf33a09SBen Walkerstruct spdk_bdev_fn_table { 56daf33a09SBen Walker /* Destroy the backend block device object */ 57daf33a09SBen Walker int (*destruct)(void *ctx); 58daf33a09SBen Walker 59daf33a09SBen Walker /* Process the IO. */ 60daf33a09SBen Walker void (*submit_request)(struct spdk_io_channel *ch, struct spdk_bdev_io *); 61daf33a09SBen Walker 62daf33a09SBen Walker /* Check if the block device supports a specific I/O type. */ 63daf33a09SBen Walker bool (*io_type_supported)(void *ctx, enum spdk_bdev_io_type); 64daf33a09SBen Walker 65daf33a09SBen Walker /* Get an I/O channel for the specific bdev for the calling thread. */ 66daf33a09SBen Walker struct spdk_io_channel *(*get_io_channel)(void *ctx); 67daf33a09SBen Walker 68daf33a09SBen Walker /* 69daf33a09SBen Walker * Output driver-specific configuration to a JSON stream. Optional - may be NULL. 70daf33a09SBen Walker * 71daf33a09SBen Walker * The JSON write context will be initialized with an open object, so the bdev 72daf33a09SBen Walker * driver should write a name (based on the driver name) followed by a JSON value 73daf33a09SBen Walker * (most likely another nested object). 74daf33a09SBen Walker */ 75daf33a09SBen Walker int (*dump_config_json)(void *ctx, struct spdk_json_write_ctx *w); 76daf33a09SBen Walker 77daf33a09SBen Walker /* Get spin-time per I/O channel in microseconds. 78daf33a09SBen Walker * Optional - may be NULL. 79daf33a09SBen Walker */ 80daf33a09SBen Walker uint64_t (*get_spin_time)(struct spdk_io_channel *ch); 81daf33a09SBen Walker}; 82daf33a09SBen Walker~~~ 83daf33a09SBen Walker 84daf33a09SBen WalkerThe bdev module must implement these function callbacks. 85daf33a09SBen Walker 86daf33a09SBen WalkerThe `destruct` function is called to tear down the device when the system no 87daf33a09SBen Walkerlonger needs it. What `destruct` does is up to the module - it may just be 88daf33a09SBen Walkerfreeing memory or it may be shutting down a piece of hardware. 89daf33a09SBen Walker 90daf33a09SBen WalkerThe `io_type_supported` function returns whether a particular I/O type is 91daf33a09SBen Walkersupported. The available I/O types are: 92daf33a09SBen Walker 93daf33a09SBen Walker~~~{.c} 94daf33a09SBen Walker/** bdev I/O type */ 95daf33a09SBen Walkerenum spdk_bdev_io_type { 96daf33a09SBen Walker SPDK_BDEV_IO_TYPE_INVALID = 0, 97daf33a09SBen Walker SPDK_BDEV_IO_TYPE_READ, 98daf33a09SBen Walker SPDK_BDEV_IO_TYPE_WRITE, 99daf33a09SBen Walker SPDK_BDEV_IO_TYPE_UNMAP, 100daf33a09SBen Walker SPDK_BDEV_IO_TYPE_FLUSH, 101daf33a09SBen Walker SPDK_BDEV_IO_TYPE_RESET, 102daf33a09SBen Walker SPDK_BDEV_IO_TYPE_NVME_ADMIN, 103daf33a09SBen Walker SPDK_BDEV_IO_TYPE_NVME_IO, 104daf33a09SBen Walker SPDK_BDEV_IO_TYPE_NVME_IO_MD, 105daf33a09SBen Walker SPDK_BDEV_IO_TYPE_WRITE_ZEROES, 106daf33a09SBen Walker}; 107daf33a09SBen Walker~~~ 108daf33a09SBen Walker 109daf33a09SBen WalkerFor the simplest bdev modules, only `SPDK_BDEV_IO_TYPE_READ` and 110daf33a09SBen Walker`SPDK_BDEV_IO_TYPE_WRITE` are necessary. `SPDK_BDEV_IO_TYPE_UNMAP` is often 111daf33a09SBen Walkerreferred to as "trim" or "deallocate", and is a request to mark a set of 112daf33a09SBen Walkerblocks as no longer containing valid data. `SPDK_BDEV_IO_TYPE_FLUSH` is a 113daf33a09SBen Walkerrequest to make all previously completed writes durable. Many devices do not 114daf33a09SBen Walkerrequire flushes. `SPDK_BDEV_IO_TYPE_WRITE_ZEROES` is just like a regular 115daf33a09SBen Walkerwrite, but does not provide a data buffer (it would have just contained all 116daf33a09SBen Walker0's). If it isn't supported, the generic bdev code is capable of emulating it 117daf33a09SBen Walkerby sending regular write requests. 118daf33a09SBen Walker 119daf33a09SBen Walker`SPDK_BDEV_IO_TYPE_RESET` is a request to abort all I/O and return the 120daf33a09SBen Walkerunderlying device to its initial state. Do not complete the reset request 121daf33a09SBen Walkeruntil all I/O has been completed in some way. 122daf33a09SBen Walker 123daf33a09SBen Walker`SPDK_BDEV_IO_TYPE_NVME_ADMIN`, `SPDK_BDEV_IO_TYPE_NVME_IO`, and 124daf33a09SBen Walker`SPDK_BDEV_IO_TYPE_NVME_IO_MD` are all mechanisms for passing raw NVMe 125daf33a09SBen Walkercommands through the SPDK bdev layer. They're strictly optional, and it 126daf33a09SBen Walkerprobably only makes sense to implement those if the backing storage device is 127daf33a09SBen Walkercapable of handling NVMe commands. 128daf33a09SBen Walker 129daf33a09SBen WalkerThe `get_io_channel` function should return an I/O channel. For a detailed 130daf33a09SBen Walkerexplanation of I/O channels, see @ref concurrency. The generic bdev layer will 131daf33a09SBen Walkercall `get_io_channel` one time per thread, cache the result, and pass that 132daf33a09SBen Walkerresult to `submit_request`. It will use the corresponding channel for the 133daf33a09SBen Walkerthread it calls `submit_request` on. 134daf33a09SBen Walker 135daf33a09SBen WalkerThe `submit_request` function is called to actually submit I/O requests to the 136daf33a09SBen Walkerblock device. Once the I/O request is completed, the module must call 137daf33a09SBen Walkerspdk_bdev_io_complete(). The I/O does not have to finish within the calling 138daf33a09SBen Walkercontext of `submit_request`. 139daf33a09SBen Walker 1407a660b30SMonica KenguvaIntegrating a new bdev module into the build system requires updates to various 1417a660b30SMonica Kenguvafiles in the /mk directory. 1427a660b30SMonica Kenguva 1437a660b30SMonica Kenguva## Creating Bdevs in an External Repository 1447a660b30SMonica Kenguva 1457a660b30SMonica KenguvaA User can build their own bdev module and application on top of existing SPDK libraries. The example in 1467a660b30SMonica Kenguvatest/external_code serves as a template for creating, building and linking an external 1477a660b30SMonica Kenguvabdev module. Refer to test/external_code/README.md and @ref so_linking for further information. 1487a660b30SMonica Kenguva 149daf33a09SBen Walker## Creating Virtual Bdevs 150daf33a09SBen Walker 151daf33a09SBen WalkerBlock devices are considered virtual if they handle I/O requests by routing 152daf33a09SBen Walkerthe I/O to other block devices. The canonical example would be a bdev module 153daf33a09SBen Walkerthat implements RAID. Virtual bdevs are created in the same way as regular 15424ea815bSMike Gerdtsbdevs, but take the one additional step of claiming the bdev. 15524ea815bSMike Gerdts 15624ea815bSMike GerdtsThe module can open the underlying bdevs it wishes to route I/O to using 15724ea815bSMike Gerdtsspdk_bdev_open_ext(), where the string name is provided by the user via an RPC. 158a7eb6187SMike GerdtsTo ensure that other consumers do not modify the underlying bdev in an unexpected 159a7eb6187SMike Gerdtsway, the virtual bdev should take a claim on the underlying bdev before 160a7eb6187SMike Gerdtsreading from or writing to the underlying bdev. 161a7eb6187SMike Gerdts 162a7eb6187SMike GerdtsThere are two slightly different APIs for taking and releasing claims. The 163a7eb6187SMike Gerdtspreferred interface uses `spdk_bdev_module_claim_bdev_desc()`. This method allows 164a7eb6187SMike Gerdtsclaims that ensure there is a single writer with 165*58c75caaSMike Gerdts`SPDK_BDEV_CLAIM_READ_MANY_WRITE_ONE`, cooperating shared writers with 166*58c75caaSMike Gerdts`SPDK_BDEV_CLAIM_READ_MANY_WRITE_SHARED`, and shared readers that prevent any 167*58c75caaSMike Gerdtswriters with `SPDK_BDEV_CLAIM_READ_MANY_WRITE_NONE`. In all cases, 168a7eb6187SMike Gerdts`spdk_bdev_open_ext()` may be used to open the underlying bdev read-only. If a 169a7eb6187SMike Gerdtsread-only bdev descriptor successfully claims a bdev with 170*58c75caaSMike Gerdts`SPDK_BDEV_CLAIM_READ_MANY_WRITE_ONE` or `SPDK_BDEV_CLAIM_READ_MANY_WRITE_SHARED` 171a7eb6187SMike Gerdtsthe bdev descriptor is promoted to read-write. 172a7eb6187SMike GerdtsAny claim that is obtained with `spdk_bdev_module_claim_bdev_desc()` is 173a7eb6187SMike Gerdtsautomatically released upon closing the bdev descriptor used to obtain the 174a7eb6187SMike Gerdtsclaim. Shared claims continue to block new incompatible claims and new writers 175a7eb6187SMike Gerdtsuntil the last claim is released. 176a7eb6187SMike Gerdts 177a7eb6187SMike GerdtsThe non-preferred interface for obtaining a claim allows the caller to obtain 178a7eb6187SMike Gerdtsan exclusive writer claim with `spdk_bdev_module_claim_bdev()`. It may be 179a7eb6187SMike Gerdtsbe released with `spdk_bdev_module_release_bdev()`. If a read-only bdev 180a7eb6187SMike Gerdtsdescriptor is passed, it is promoted to read-write. NULL may be passed instead 181a7eb6187SMike Gerdtsof a bdev descriptor to avoid promotion and to block new writers. New code 182a7eb6187SMike Gerdtsshould use `spdk_bdev_module_claim_bdev_desc()` with the claim type that is 183a7eb6187SMike Gerdtstailored to the virtual bdev's needs. 18424ea815bSMike Gerdts 18524ea815bSMike GerdtsThe descriptor obtained from the successful spdk_bdev_open_ext() may be used 18624ea815bSMike Gerdtswith spdk_bdev_get_io_channel() to obtain I/O channels for the bdev. This is 18724ea815bSMike Gerdtslikely done in response to the virtual bdev's `get_io_channel` callback. 188a7eb6187SMike GerdtsChannels may be obtained before and/or after claiming the underlying bdev, but 189a7eb6187SMike Gerdtsbeware there may be other unknown writers until the underlying bdev has been 190a7eb6187SMike Gerdtsclaimed. 19124ea815bSMike Gerdts 192a7eb6187SMike GerdtsWhen a virtual bdev module claims an underlying bdev from its `examine_config` 193a7eb6187SMike Gerdtscallback, it causes the `examine_disk` callback to only be called for this 194a7eb6187SMike Gerdtsmodule and any others that establish a shared claim. If no claims are taken by 195a7eb6187SMike Gerdts`examine_config` callbacks, all virtual bdevs' `examine_disk` callbacks are 196a7eb6187SMike Gerdtscalled. 197