xref: /spdk/doc/blob.md (revision 2acfb846687cfce6fe57179329fdcee28858db03)
1# Blobstore Programmer's Guide {#blob}
2
3## In this document {#blob_pg_toc}
4
5* @ref blob_pg_audience
6* @ref blob_pg_intro
7* @ref blob_pg_theory
8* @ref blob_pg_design
9* @ref blob_pg_examples
10* @ref blob_pg_config
11* @ref blob_pg_component
12
13## Target Audience {#blob_pg_audience}
14
15The programmer's guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is
16intended to supplement the source code in providing an overall understanding of how to integrate Blobstore into
17an application as well as provide some high level insight into how Blobstore works behind the scenes. It is not
18intended to serve as a design document or an API reference and in some cases source code snippets and high level
19sequences will be discussed; for the latest source code reference refer to the [repo](https://github.com/spdk).
20
21## Introduction {#blob_pg_intro}
22
23Blobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system
24backing a higher level storage service, typically in lieu of a traditional filesystem. These higher level services
25can be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or
26distributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however,
27and it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead
28using the term 'blob'. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to
29groups of blocks on a block device called 'blobs'. Blobs are typically large, measured in at least hundreds of
30kilobytes, and are always a multiple of the underlying block size.
31
32The Blobstore is designed primarily to run on "next generation" media, which means the device supports fast random
33reads and writes, with no required background garbage collection. However, in practice the design will run well on
34NAND too.
35
36## Theory of Operation {#blob_pg_theory}
37
38### Abstractions
39
40The Blobstore defines a hierarchy of storage abstractions as follows.
41
42* **Logical Block**: Logical blocks are exposed by the disk itself, which are numbered from 0 to N, where N is the
43  number of blocks in the disk. A logical block is typically either 512B or 4KiB.
44* **Page**: A page is defined to be a fixed number of logical blocks defined at Blobstore creation time. The logical
45  blocks that compose a page are always contiguous. Pages are also numbered from the beginning of the disk such
46  that the first page worth of blocks is page 0, the second page is page 1, etc. A page is typically 4KiB in size,
47  so this is either 8 or 1 logical blocks in practice. The SSD must be able to perform atomic reads and writes of
48  at least the page size.
49* **Cluster**: A cluster is a fixed number of pages defined at Blobstore creation time. The pages that compose a cluster
50  are always contiguous. Clusters are also numbered from the beginning of the disk, where cluster 0 is the first cluster
51  worth of pages, cluster 1 is the second grouping of pages, etc. A cluster is typically 1MiB in size, or 256 pages.
52* **Blob**: A blob is an ordered list of clusters. Blobs are manipulated (created, sized, deleted, etc.) by the application
53  and persist across power failures and reboots. Applications use a Blobstore provided identifier to access a particular blob.
54  Blobs are read and written in units of pages by specifying an offset from the start of the blob. Applications can also
55  store metadata in the form of key/value pairs with each blob which we'll refer to as xattrs (extended attributes).
56* **Blobstore**: An SSD which has been initialized by a Blobstore-based application is referred to as "a Blobstore." A
57  Blobstore owns the entire underlying device which is made up of a private Blobstore metadata region and the collection of
58  blobs as managed by the application.
59
60```text
61+-----------------------------------------------------------------+
62|                              Blob                               |
63| +-----------------------------+ +-----------------------------+ |
64| |           Cluster           | |           Cluster           | |
65| | +----+ +----+ +----+ +----+ | | +----+ +----+ +----+ +----+ | |
66| | |Page| |Page| |Page| |Page| | | |Page| |Page| |Page| |Page| | |
67| | +----+ +----+ +----+ +----+ | | +----+ +----+ +----+ +----+ | |
68| +-----------------------------+ +-----------------------------+ |
69+-----------------------------------------------------------------+
70```
71
72### Atomicity
73
74For all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic
75operations of at least one page size. Atomicity here can refer to multiple operations:
76
77* **Data Writes**: For the case of data writes, the unit of atomicity is one page. Therefore if a write operation of
78  greater than one page is underway and the system suffers a power failure, the data on media will be consistent at a page
79  size granularity (if a single page were in the middle of being updated when power was lost, the data at that page location
80  will be as it was prior to the start of the write operation following power restoration.)
81* **Blob Metadata Updates**: Each blob has its own set of metadata (xattrs, size, etc). For performance reasons, a copy of
82  this metadata is kept in RAM and only synchronized with the on-disk version when the application makes an explicit call to
83  do so, or when the Blobstore is unloaded. Therefore, setting of an xattr, for example is not consistent until the call to
84  synchronize it (covered later) which is, however, performed atomically.
85* **Blobstore Metadata Updates**: Blobstore itself has its own metadata which, like per blob metadata, has a copy in both
86  RAM and on-disk. Unlike the per blob metadata, however, the Blobstore metadata region is not made consistent via a blob
87  synchronization call, it is only synchronized when the Blobstore is properly unloaded via API. Therefore, if the Blobstore
88  metadata is updated (blob creation, deletion, resize, etc.) and not unloaded properly, it will need to perform some extra
89  steps the next time it is loaded which will take a bit more time than it would have if shutdown cleanly, but there will be
90  no inconsistencies.
91
92### Callbacks
93
94Blobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will
95not block but instead return control at that point and make a call to the callback function provided in the API, along with
96arguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on
97threads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of
98asynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling
99is required before the IO is completed.
100
101### Backend Support
102
103Blobstore requires a backing storage device that can be integrated using the `bdev` layer, or by directly integrating a
104device driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers
105supplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O
106to the bdev layer is available in `bdev_blob.c`.  Alternatively, for example, the SPDK NVMe driver may be directly integrated
107bypassing a small amount of `bdev` layer overhead. These options will be discussed further in the upcoming section on examples.
108
109### Metadata Operations
110
111Because Blobstore is designed to be lock-free, metadata operations need to be isolated to a single
112thread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along
113with other data). In Blobstore this is implemented as `the metadata thread` and is defined to be the thread on which the
114application makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on
115and to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads.
116This will be discussed further in the Design Considerations section.
117
118### Threads
119
120An application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios.
121The simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a
122single core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and
123it would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with
124affected IO operations.
125
126### Channels
127
128Channels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are
129required in order to do IO.  The application will perform IO to the channel and channels are best thought of as being
130associated 1:1 with a thread.
131
132With external snapshots (see @ref blob_pg_esnap_and_esnap_clone), a read from a blob may lead to
133reading from the device containing the blobstore or an external snapshot device. To support this,
134each blobstore IO channel maintains a tree of channels to be used when reading from external
135snapshot devices.
136
137### Blob Identifiers
138
139When an application creates a blob, it does not provide a name as is the case with many other similar
140storage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to
141perform operations on the Blobstore.
142
143## Design Considerations {#blob_pg_design}
144
145### Initialization Options
146
147When the Blobstore is initialized, there are multiple configuration options to consider. The
148options and their defaults are:
149
150* **Cluster Size**: By default, this value is 1MB. The cluster size is required to be a multiple of page size and should be
151  selected based on the application’s usage model in terms of allocation. Recall that blobs are made up of clusters so when
152  a blob is allocated/deallocated or changes in size, disk LBAs will be manipulated in groups of cluster size.  If the
153  application is expecting to deal with mainly very large (always multiple GB) blobs then it may make sense to change the
154  cluster size to 1GB for example.
155* **Number of Metadata Pages**: By default, Blobstore will assume there can be as many clusters as there are metadata pages
156  which is the worst case scenario in terms of metadata usage and can be overridden here however the space efficiency is
157  not significant.
158* **Maximum Simultaneous Metadata Operations**: Determines how many internally pre-allocated memory structures are set
159  aside for performing metadata operations. It is unlikely that changes to this value (default 32) would be desirable.
160* **Maximum Simultaneous Operations Per Channel**: Determines how many internally pre-allocated memory structures are set
161  aside for channel operations. Changes to this value would be application dependent and best determined by both a knowledge
162  of the typical usage model, an understanding of the types of SSDs being used and empirical data. The default is 512.
163* **Blobstore Type**: This field is a character array to be used by applications that need to identify whether the
164  Blobstore found here is appropriate to claim or not. The default is NULL and unless the application is being deployed in
165  an environment where multiple applications using the same disks are at risk of inadvertently using the wrong Blobstore, there
166  is no need to set this value. It can, however, be set to any valid set of characters.
167* **External Snapshot Device Creation Callback**: If the blobstore supports external snapshots this function will be called
168  as a blob that clones an external snapshot (an "esnap clone") is opened so that the blobstore consumer can load the external
169  snapshot and register a blobstore device that will satisfy read requests. See @ref blob_pg_esnap_and_esnap_clone.
170
171### Sub-page Sized Operations
172
173Blobstore is only capable of doing page sized read/write operations. If the application
174requires finer granularity it will have to accommodate that itself.
175
176### Threads
177
178As mentioned earlier, Blobstore can share a single thread with an application or the application
179can define any number of threads, within resource constraints, that makes sense.  The basic considerations that must be
180followed are:
181
182* Metadata operations (API with MD in the name) should be isolated from each other as there is no internal locking on the
183   memory structures affected by these API.
184* Metadata operations should be isolated from conflicting IO operations (an example of a conflicting IO would be one that is
185  reading/writing to an area of a blob that a metadata operation is deallocating).
186* Asynchronous callbacks will always take place on the calling thread.
187* No assumptions about IO ordering can be made regardless of how many or which threads were involved in the issuing.
188
189### Data Buffer Memory
190
191As with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated
192with SPDK API.
193
194### Error Handling
195
196Asynchronous Blobstore callbacks all include an error number that should be checked; non-zero values
197indicate an error. Synchronous calls will typically return an error value if applicable.
198
199### Asynchronous API
200
201Asynchronous callbacks will return control not immediately, but at the point in execution where no
202more forward progress can be made without blocking.  Therefore, no assumptions can be made about the progress of
203an asynchronous call until the callback has completed.
204
205### Xattrs
206
207Setting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata.
208Therefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for
209persisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one
210more expensive call to synchronize and persist the values.
211
212### Synchronizing Metadata
213
214As described earlier, there are two types of metadata in Blobstore, per blob and one global
215metadata for the Blobstore itself.  Only the per blob metadata can be explicitly synchronized via API. The global
216metadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of
217an improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt
218based on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore
219properly via API.
220
221### Iterating Blobs
222
223Multiple examples of how to iterate through the blobs are included in the sample code and tools.
224Worthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its
225looking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking
226the full list.
227
228### The Super Blob
229
230The super blob is simply a single blob ID that can be stored as part of the global metadata to act
231as sort of a "root" blob. The application may choose to use this blob to store any information that it needs or finds
232relevant in understanding any kind of structure for what is on the Blobstore.
233
234## Examples {#blob_pg_examples}
235
236There are multiple examples of Blobstore usage in the [repo](https://github.com/spdk/spdk):
237
238* **Hello World**: Actually named `hello_blob.c` this is a very basic example of a single threaded application that
239  does nothing more than demonstrate the very basic API. Although Blobstore is optimized for NVMe, this example uses
240  a RAM disk (malloc) back-end so that it can be executed easily in any development environment. The malloc back-end
241  is a `bdev` module thus this example uses not only the SPDK Framework but the `bdev` layer as well.
242
243* **CLI**: The `blobcli.c` example is command line utility intended to not only serve as example code but as a test
244  and development tool for Blobstore itself. It is also a simple single threaded application that relies on both the
245  SPDK Framework and the `bdev` layer but offers multiple modes of operation to accomplish some real-world tasks. In
246  command mode, it accepts single-shot commands which can be a little time consuming if there are many commands to
247  get through as each one will take a few seconds waiting for DPDK initialization. It therefore has a shell mode that
248  allows the developer to get to a `blob>` prompt and then very quickly interact with Blobstore with simple commands
249  that include the ability to import/export blobs from/to regular files. Lastly there is a scripting mode to automate
250  a series of tasks, again, handy for development and/or test type activities.
251
252## Configuration {#blob_pg_config}
253
254Blobstore configuration options are described in the initialization options section under @ref blob_pg_design.
255
256## Component Detail {#blob_pg_component}
257
258The information in this section is not necessarily relevant to designing an application for use with Blobstore, but
259understanding a little more about the internals may be interesting and is also included here for those wanting to
260contribute to the Blobstore effort itself.
261
262### Media Format
263
264The Blobstore owns the entire storage device. The device is divided into clusters starting from the beginning, such
265that cluster 0 begins at the first logical block.
266
267```text
268LBA 0                                   LBA N
269+-----------+-----------+-----+-----------+
270| Cluster 0 | Cluster 1 | ... | Cluster N |
271+-----------+-----------+-----+-----------+
272```
273
274Cluster 0 is special and has the following format, where page 0 is the first page of the cluster:
275
276```text
277+--------+-------------------+
278| Page 0 | Page 1 ... Page N |
279+--------+-------------------+
280| Super  |  Metadata Region  |
281| Block  |                   |
282+--------+-------------------+
283```
284
285The super block is a single page located at the beginning of the partition. It contains basic information about
286the Blobstore. The metadata region is the remainder of cluster 0 and may extend to additional clusters. Refer
287to the latest source code for complete structural details of the super block and metadata region.
288
289Each blob is allocated a non-contiguous set of pages inside the metadata region for its metadata. These pages
290form a linked list. The first page in the list will be written in place on update, while all other pages will
291be written to fresh locations. This requires the backing device to support an atomic write size greater than
292or equal to the page size to guarantee that the operation is atomic. See the section on atomicity for details.
293
294### Blob cluster layout {#blob_pg_cluster_layout}
295
296Each blob is an ordered list of clusters, where starting LBA of a cluster is called extent. A blob can be
297thin provisioned, resulting in no extent for some of the clusters. When first write operation occurs
298to the unallocated cluster - new extent is chosen. This information is stored in RAM and on-disk.
299
300There are two extent representations on-disk, dependent on `use_extent_table` (default:true) opts used
301when creating a blob.
302
303* **use_extent_table=true**: EXTENT_PAGE descriptor is not part of linked list of pages. It contains extents
304  that are not run-length encoded. Each extent page is referenced by EXTENT_TABLE descriptor, which is serialized
305  as part of linked list of pages.  Extent table is run-length encoding all unallocated extent pages.
306  Every new cluster allocation updates a single extent page, in case when extent page was previously allocated.
307  Otherwise additionally incurs serializing whole linked list of pages for the blob.
308
309* **use_extent_table=false**: EXTENT_RLE descriptor is serialized as part of linked list of pages.
310  Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
311  Every new cluster allocation incurs serializing whole linked list of pages for the blob.
312
313### Thin Blobs, Snapshots, and Clones
314
315Each in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
316allocated to a blob it is considered owned by that blob and that particular blob's metadata
317maintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
318during snapshot operations described later in @ref blob_pg_snapshots.
319
320Through the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
321owns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
322on whether the operation targets blocks that are backed by a cluster owned by the blob or not.
323
324* **read from blocks on an owned cluster**: The read is serviced by reading directly from the
325  appropriate cluster.
326* **read from other blocks**: The read is passed on to the blob's *back device* and the back
327  device services the read. The back device may be another blob or it may be a zeroes device.
328* **write to blocks on an owned cluster**: The write is serviced by writing directly to the
329  appropriate cluster.
330* **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
331  is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
332* **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
333  for details.
334
335External snapshots allow some external data source to act as a snapshot. This allows clones to be
336created of data that resides outside of the blobstore containing the clone.
337
338#### Thin Provisioning {#blob_pg_thin_provisioning}
339
340As mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
341starts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
342blob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
343zeroes.
344
345When a thin provisioned volume writes to a block that does not have an allocated cluster, the
346following steps are performed:
347
3481. Allocate a cluster.
3492. Update blob metadata.
3503. Perform the write.
351
352#### Snapshots and Clones {#blob_pg_snapshots}
353
354A snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
355blob. While the interface gives the illusion of being able to create many snapshots of a blob, under
356the covers this results in a chain of snapshots that are clones of the previous snapshot.
357
358When blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
359blob. That is:
360
361| Step | Action                         | State                                             |
362| ---- | ------------------------------ | ------------------------------------------------- |
363| 1    | Create blob1                   | `blob1 (rw)`                                      |
364| 2    | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)`                       |
365| 2a   | Write to blob1                 | `blob1 (rw) --> blob2 (ro)`                       |
366| 3    | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)`       |
367
368Supposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
369full write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
370transferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
371one or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
372allocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
373becomes blob3.
374
375It is important to understand the chain above when considering strategies to use a golden image from
376which many clones are made. The IO path is more efficient if one snapshot is cloned many times than
377it is to create a new snapshot for every clone. The following illustrates the difference.
378
379Using a single snapshot means the data originally referenced by the golden image is always one hop
380away.
381
382```text
383create golden                           golden --> golden-snap
384snapshot golden as golden-snap                     ^ ^ ^
385clone golden-snap as clone1              clone1 ---+ | |
386clone golden-snap as clone2              clone2 -----+ |
387clone golden-snap as clone3              clone3 -------+
388```
389
390Using a snapshot per clone means that the chain of back devices grows with every new snapshot and
391clone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
392clone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
393blocks originally allocated to golden).
394
395```text
396create golden
397snapshot golden as snap1                golden --> snap3 -----> snap2 ----> snap1
398clone snap1 as clone1                   clone3----/   clone2 --/  clone1 --/
399snapshot golden as snap2
400clone snap2 as clone2
401snapshot golden as snap3
402clone snap3 as clone3
403```
404
405A snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
406the clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
407freed, depending on whether the clone already owns a cluster for a particular block range.
408
409Removal of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
410can serve as the snapshot for future clones.
411
412#### Inflating and Decoupling Clones
413
414A clone can remove its dependence on a snapshot with the following operations:
415
4161. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
417   allocated clusters. The blob becomes a thick provisioned blob.
4182. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
419   allocated clusters. If the clone's back device snapshot was itself a clone of another
420   snapshot, the clone remains a clone but is now a clone of a different snapshot.
4213. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
422   usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
423   than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
424   clone remains a clone, but is now a clone of a different snapshot.
425
426#### External Snapshots and Esnap Clones {#blob_pg_esnap_and_esnap_clone}
427
428A blobstore that is loaded with the `esnap_bs_dev_create` callback defined will support external
429snapshots (esnaps). An external snapshot is not useful on its own: it needs to be cloned by a blob.
430A clone of an external snapshot is referred to as an *esnap clone*. An esnap clone supports IO and
431other operations just like any other clone.
432
433An esnap clone can be recognized in various ways:
434
435* **On disk**: the blob metadata has the `SPDK_BLOB_EXTERNAL_SNAPSHOT` (0x8) bit is set in
436  `invalid_flags` and an internal XATTR with name `BLOB_EXTERNAL_SNAPSHOT_ID` ("EXTSNAP") exists.
437* **In memory**: The `spdk_blob` structure contains the metadata read from disk, `blob->parent_id`
438  is set to `SPDK_BLOBID_EXTERNAL_SNAPSHOT`, and `blob->back_bs_dev` references a blobstore device
439  which is not a blob in the same blobstore nor a zeroes device.
440
441#### Shallow Copy {#blob_shallow_copy}
442
443A read only blob can be copied over a blob store device in a way that only clusters
444allocated to the blob will be written on the device. This device must have a size equal or greater
445than blob's size and blob store's block size must be an integer multiple of device's block size.
446This functionality can be used to recreate the entire snapshot stack of a blob into a different blob
447store.
448
449#### Change the parent of a blob {#blob_reparent}
450
451We can change the parent of a thin provisioned blob, making the blob a clone of a snapshot of the
452same blobstore or a clone of an external snapshot. The previous parent of the blob can be a snapshot,
453an external snapshot or none.
454
455If the new parent of the blob is a snapshot of the same blobstore, blob and snapshot must have the same number of clusters.
456
457If the new parent of the blob is an external snapshot, the size of the esnap must be an integer multiple of
458blob's cluster size.
459
460#### Copy-on-write {#blob_pg_copy_on_write}
461
462A copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
463size. Typical copy-on-write involves the following steps:
464
4651. Allocate a cluster.
4662. Allocate a cluster-sized buffer into which data can be read.
4673. Trigger a full-cluster read from the back device into the cluster-sized buffer.
4684. Write from the cluster-sized buffer into the newly allocated cluster.
4695. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
470   involves at least one page-sized write.
4716. Write the new data to the just allocated and copied cluster.
472
473If the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
474the blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
475offloaded to the device. Neither of these optimizations are available when the back device is an
476external snapshot.
477
478### Sequences and Batches
479
480Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
481a serial fashion or in parallel, respectively. Both are defined using the following structure:
482
483~~~{.sh}
484struct spdk_bs_request_set;
485~~~
486
487These requests sets are basically bookkeeping mechanisms to help Blobstore efficiently deal with related groups
488of IO. They are an internal construct only and are pre-allocated on a per channel basis (channels were discussed
489earlier). They are removed from a channel associated linked list when the set (sequence or batch) is started and
490then returned to the list when completed.
491
492Each request set maintains a reference to a `channel` and a `back_channel`. The `channel` is used
493for performing IO on the blobstore device. The `back_channel` is used for performing IO on the
494blob's back device, `blob->back_bs_dev`. For blobs that are not esnap clones, `channel` and
495`back_channel` reference an IO channel used with the device that contains the blobstore.  For blobs
496that are esnap clones, `channel` is the same as with any other blob and `back_channel` is an IO
497channel for the external snapshot device.
498
499### Key Internal Structures
500
501`blobstore.h` contains many of the key structures for the internal workings of Blobstore. Only a few notable ones
502are reviewed here.  Note that `blobstore.h` is an internal header file, the header file for Blobstore that defines
503the public API is `blob.h`.
504
505~~~{.sh}
506struct spdk_blob
507~~~
508This is an in-memory data structure that contains key elements like the blob identifier, its current state and two
509copies of the mutable metadata for the blob; one copy is the current metadata and the other is the last copy written
510to disk.
511
512~~~{.sh}
513struct spdk_blob_mut_data
514~~~
515This is a per blob structure, included the `struct spdk_blob` struct that actually defines the blob itself. It has the
516specific information on size and makeup of the blob (ie how many clusters are allocated for this blob and which ones.)
517
518~~~{.sh}
519struct spdk_blob_store
520~~~
521This is the main in-memory structure for the entire Blobstore. It defines the global on disk metadata region and maintains
522information relevant to the entire system - initialization options such as cluster size, etc.
523
524~~~{.sh}
525struct spdk_bs_super_block
526~~~
527The super block is an on-disk structure that contains all of the relevant information that's in the in-memory Blobstore
528structure just discussed along with other elements one would expect to see here such as signature, version, checksum, etc.
529
530### Code Layout and Common Conventions
531
532In general, `Blobstore.c` is laid out with groups of related functions blocked together with descriptive comments. For
533example,
534
535~~~{.sh}
536/* START spdk_bs_md_delete_blob */
537< relevant functions to accomplish the deletion of a blob >
538/* END spdk_bs_md_delete_blob */
539~~~
540
541And for the most part the following conventions are followed throughout:
542
543* functions beginning with an underscore are called internally only
544* functions or variables with the letters `cpl` are related to set or callback completions
545