xref: /spdk/doc/blob.md (revision 2acfb846687cfce6fe57179329fdcee28858db03)
1da58800fSPaul Luse# Blobstore Programmer's Guide {#blob}
21a787169SDaniel Verkamp
31e1fd9acSwawryk## In this document {#blob_pg_toc}
41a787169SDaniel Verkamp
5da58800fSPaul Luse* @ref blob_pg_audience
6da58800fSPaul Luse* @ref blob_pg_intro
7da58800fSPaul Luse* @ref blob_pg_theory
8da58800fSPaul Luse* @ref blob_pg_design
9da58800fSPaul Luse* @ref blob_pg_examples
10da58800fSPaul Luse* @ref blob_pg_config
11da58800fSPaul Luse* @ref blob_pg_component
121a787169SDaniel Verkamp
13da58800fSPaul Luse## Target Audience {#blob_pg_audience}
141a787169SDaniel Verkamp
15da58800fSPaul LuseThe programmer's guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is
16da58800fSPaul Luseintended to supplement the source code in providing an overall understanding of how to integrate Blobstore into
17da58800fSPaul Lusean application as well as provide some high level insight into how Blobstore works behind the scenes. It is not
18da58800fSPaul Luseintended to serve as a design document or an API reference and in some cases source code snippets and high level
19da58800fSPaul Lusesequences will be discussed; for the latest source code reference refer to the [repo](https://github.com/spdk).
201a787169SDaniel Verkamp
21da58800fSPaul Luse## Introduction {#blob_pg_intro}
221a787169SDaniel Verkamp
23da58800fSPaul LuseBlobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system
24da58800fSPaul Lusebacking a higher level storage service, typically in lieu of a traditional filesystem. These higher level services
25da58800fSPaul Lusecan be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or
26da58800fSPaul Lusedistributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however,
27da58800fSPaul Luseand it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead
28da58800fSPaul Luseusing the term 'blob'. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to
29da58800fSPaul Lusegroups of blocks on a block device called 'blobs'. Blobs are typically large, measured in at least hundreds of
30da58800fSPaul Lusekilobytes, and are always a multiple of the underlying block size.
311a787169SDaniel Verkamp
32da58800fSPaul LuseThe Blobstore is designed primarily to run on "next generation" media, which means the device supports fast random
33da58800fSPaul Lusereads and writes, with no required background garbage collection. However, in practice the design will run well on
34da58800fSPaul LuseNAND too.
351a787169SDaniel Verkamp
36da58800fSPaul Luse## Theory of Operation {#blob_pg_theory}
371a787169SDaniel Verkamp
3871efe5dbSKarol Latecki### Abstractions
391a787169SDaniel Verkamp
40da58800fSPaul LuseThe Blobstore defines a hierarchy of storage abstractions as follows.
411a787169SDaniel Verkamp
42da58800fSPaul Luse* **Logical Block**: Logical blocks are exposed by the disk itself, which are numbered from 0 to N, where N is the
43da58800fSPaul Luse  number of blocks in the disk. A logical block is typically either 512B or 4KiB.
44da58800fSPaul Luse* **Page**: A page is defined to be a fixed number of logical blocks defined at Blobstore creation time. The logical
45da58800fSPaul Luse  blocks that compose a page are always contiguous. Pages are also numbered from the beginning of the disk such
46da58800fSPaul Luse  that the first page worth of blocks is page 0, the second page is page 1, etc. A page is typically 4KiB in size,
47da58800fSPaul Luse  so this is either 8 or 1 logical blocks in practice. The SSD must be able to perform atomic reads and writes of
48da58800fSPaul Luse  at least the page size.
49da58800fSPaul Luse* **Cluster**: A cluster is a fixed number of pages defined at Blobstore creation time. The pages that compose a cluster
50da58800fSPaul Luse  are always contiguous. Clusters are also numbered from the beginning of the disk, where cluster 0 is the first cluster
51da58800fSPaul Luse  worth of pages, cluster 1 is the second grouping of pages, etc. A cluster is typically 1MiB in size, or 256 pages.
52da58800fSPaul Luse* **Blob**: A blob is an ordered list of clusters. Blobs are manipulated (created, sized, deleted, etc.) by the application
53da58800fSPaul Luse  and persist across power failures and reboots. Applications use a Blobstore provided identifier to access a particular blob.
54da58800fSPaul Luse  Blobs are read and written in units of pages by specifying an offset from the start of the blob. Applications can also
55da58800fSPaul Luse  store metadata in the form of key/value pairs with each blob which we'll refer to as xattrs (extended attributes).
56da58800fSPaul Luse* **Blobstore**: An SSD which has been initialized by a Blobstore-based application is referred to as "a Blobstore." A
57da58800fSPaul Luse  Blobstore owns the entire underlying device which is made up of a private Blobstore metadata region and the collection of
58da58800fSPaul Luse  blobs as managed by the application.
591a787169SDaniel Verkamp
60878bec9dSMike Gerdts```text
61878bec9dSMike Gerdts+-----------------------------------------------------------------+
62878bec9dSMike Gerdts|                              Blob                               |
63878bec9dSMike Gerdts| +-----------------------------+ +-----------------------------+ |
64878bec9dSMike Gerdts| |           Cluster           | |           Cluster           | |
65878bec9dSMike Gerdts| | +----+ +----+ +----+ +----+ | | +----+ +----+ +----+ +----+ | |
66878bec9dSMike Gerdts| | |Page| |Page| |Page| |Page| | | |Page| |Page| |Page| |Page| | |
67878bec9dSMike Gerdts| | +----+ +----+ +----+ +----+ | | +----+ +----+ +----+ +----+ | |
68878bec9dSMike Gerdts| +-----------------------------+ +-----------------------------+ |
69878bec9dSMike Gerdts+-----------------------------------------------------------------+
70878bec9dSMike Gerdts```
71706c57bfSBen Walker
72da58800fSPaul Luse### Atomicity
731a787169SDaniel Verkamp
74da58800fSPaul LuseFor all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic
75da58800fSPaul Luseoperations of at least one page size. Atomicity here can refer to multiple operations:
761a787169SDaniel Verkamp
77da58800fSPaul Luse* **Data Writes**: For the case of data writes, the unit of atomicity is one page. Therefore if a write operation of
78da58800fSPaul Luse  greater than one page is underway and the system suffers a power failure, the data on media will be consistent at a page
79da58800fSPaul Luse  size granularity (if a single page were in the middle of being updated when power was lost, the data at that page location
80da58800fSPaul Luse  will be as it was prior to the start of the write operation following power restoration.)
81da58800fSPaul Luse* **Blob Metadata Updates**: Each blob has its own set of metadata (xattrs, size, etc). For performance reasons, a copy of
82da58800fSPaul Luse  this metadata is kept in RAM and only synchronized with the on-disk version when the application makes an explicit call to
83da58800fSPaul Luse  do so, or when the Blobstore is unloaded. Therefore, setting of an xattr, for example is not consistent until the call to
84da58800fSPaul Luse  synchronize it (covered later) which is, however, performed atomically.
85da58800fSPaul Luse* **Blobstore Metadata Updates**: Blobstore itself has its own metadata which, like per blob metadata, has a copy in both
86da58800fSPaul Luse  RAM and on-disk. Unlike the per blob metadata, however, the Blobstore metadata region is not made consistent via a blob
87da58800fSPaul Luse  synchronization call, it is only synchronized when the Blobstore is properly unloaded via API. Therefore, if the Blobstore
88da58800fSPaul Luse  metadata is updated (blob creation, deletion, resize, etc.) and not unloaded properly, it will need to perform some extra
89da58800fSPaul Luse  steps the next time it is loaded which will take a bit more time than it would have if shutdown cleanly, but there will be
90da58800fSPaul Luse  no inconsistencies.
911a787169SDaniel Verkamp
92da58800fSPaul Luse### Callbacks
931a787169SDaniel Verkamp
94da58800fSPaul LuseBlobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will
95da58800fSPaul Lusenot block but instead return control at that point and make a call to the callback function provided in the API, along with
96da58800fSPaul Lusearguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on
97da58800fSPaul Lusethreads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of
98da58800fSPaul Luseasynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling
99da58800fSPaul Luseis required before the IO is completed.
1001a787169SDaniel Verkamp
101da58800fSPaul Luse### Backend Support
1021a787169SDaniel Verkamp
103da58800fSPaul LuseBlobstore requires a backing storage device that can be integrated using the `bdev` layer, or by directly integrating a
104da58800fSPaul Lusedevice driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers
105da58800fSPaul Lusesupplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O
106da58800fSPaul Luseto the bdev layer is available in `bdev_blob.c`.  Alternatively, for example, the SPDK NVMe driver may be directly integrated
107da58800fSPaul Lusebypassing a small amount of `bdev` layer overhead. These options will be discussed further in the upcoming section on examples.
1081a787169SDaniel Verkamp
109da58800fSPaul Luse### Metadata Operations
1101a787169SDaniel Verkamp
111da58800fSPaul LuseBecause Blobstore is designed to be lock-free, metadata operations need to be isolated to a single
112da58800fSPaul Lusethread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along
113da58800fSPaul Lusewith other data). In Blobstore this is implemented as `the metadata thread` and is defined to be the thread on which the
114da58800fSPaul Luseapplication makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on
115da58800fSPaul Luseand to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads.
116da58800fSPaul LuseThis will be discussed further in the Design Considerations section.
1171a787169SDaniel Verkamp
118da58800fSPaul Luse### Threads
1191a787169SDaniel Verkamp
120da58800fSPaul LuseAn application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios.
121da58800fSPaul LuseThe simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a
122da58800fSPaul Lusesingle core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and
123da58800fSPaul Luseit would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with
124da58800fSPaul Luseaffected IO operations.
125da58800fSPaul Luse
126da58800fSPaul Luse### Channels
127da58800fSPaul Luse
128da58800fSPaul LuseChannels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are
129da58800fSPaul Luserequired in order to do IO.  The application will perform IO to the channel and channels are best thought of as being
130da58800fSPaul Luseassociated 1:1 with a thread.
131da58800fSPaul Luse
132b47cee6cSMike GerdtsWith external snapshots (see @ref blob_pg_esnap_and_esnap_clone), a read from a blob may lead to
133b47cee6cSMike Gerdtsreading from the device containing the blobstore or an external snapshot device. To support this,
134b47cee6cSMike Gerdtseach blobstore IO channel maintains a tree of channels to be used when reading from external
135b47cee6cSMike Gerdtssnapshot devices.
136b47cee6cSMike Gerdts
137da58800fSPaul Luse### Blob Identifiers
138da58800fSPaul Luse
139da58800fSPaul LuseWhen an application creates a blob, it does not provide a name as is the case with many other similar
140da58800fSPaul Lusestorage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to
141da58800fSPaul Luseperform operations on the Blobstore.
142da58800fSPaul Luse
143da58800fSPaul Luse## Design Considerations {#blob_pg_design}
144da58800fSPaul Luse
145da58800fSPaul Luse### Initialization Options
146da58800fSPaul Luse
147da58800fSPaul LuseWhen the Blobstore is initialized, there are multiple configuration options to consider. The
148da58800fSPaul Luseoptions and their defaults are:
149da58800fSPaul Luse
150da58800fSPaul Luse* **Cluster Size**: By default, this value is 1MB. The cluster size is required to be a multiple of page size and should be
151b07d3bd2SChen Zhenghua  selected based on the application’s usage model in terms of allocation. Recall that blobs are made up of clusters so when
152da58800fSPaul Luse  a blob is allocated/deallocated or changes in size, disk LBAs will be manipulated in groups of cluster size.  If the
153da58800fSPaul Luse  application is expecting to deal with mainly very large (always multiple GB) blobs then it may make sense to change the
154da58800fSPaul Luse  cluster size to 1GB for example.
155da58800fSPaul Luse* **Number of Metadata Pages**: By default, Blobstore will assume there can be as many clusters as there are metadata pages
156da58800fSPaul Luse  which is the worst case scenario in terms of metadata usage and can be overridden here however the space efficiency is
157da58800fSPaul Luse  not significant.
158da58800fSPaul Luse* **Maximum Simultaneous Metadata Operations**: Determines how many internally pre-allocated memory structures are set
159da58800fSPaul Luse  aside for performing metadata operations. It is unlikely that changes to this value (default 32) would be desirable.
160da58800fSPaul Luse* **Maximum Simultaneous Operations Per Channel**: Determines how many internally pre-allocated memory structures are set
161da58800fSPaul Luse  aside for channel operations. Changes to this value would be application dependent and best determined by both a knowledge
162da58800fSPaul Luse  of the typical usage model, an understanding of the types of SSDs being used and empirical data. The default is 512.
163da58800fSPaul Luse* **Blobstore Type**: This field is a character array to be used by applications that need to identify whether the
164da58800fSPaul Luse  Blobstore found here is appropriate to claim or not. The default is NULL and unless the application is being deployed in
165da58800fSPaul Luse  an environment where multiple applications using the same disks are at risk of inadvertently using the wrong Blobstore, there
166da58800fSPaul Luse  is no need to set this value. It can, however, be set to any valid set of characters.
167ce67e0c7SMike Gerdts* **External Snapshot Device Creation Callback**: If the blobstore supports external snapshots this function will be called
168ce67e0c7SMike Gerdts  as a blob that clones an external snapshot (an "esnap clone") is opened so that the blobstore consumer can load the external
169ce67e0c7SMike Gerdts  snapshot and register a blobstore device that will satisfy read requests. See @ref blob_pg_esnap_and_esnap_clone.
170da58800fSPaul Luse
171da58800fSPaul Luse### Sub-page Sized Operations
172da58800fSPaul Luse
173da58800fSPaul LuseBlobstore is only capable of doing page sized read/write operations. If the application
174da58800fSPaul Luserequires finer granularity it will have to accommodate that itself.
175da58800fSPaul Luse
176da58800fSPaul Luse### Threads
177da58800fSPaul Luse
178da58800fSPaul LuseAs mentioned earlier, Blobstore can share a single thread with an application or the application
179da58800fSPaul Lusecan define any number of threads, within resource constraints, that makes sense.  The basic considerations that must be
180da58800fSPaul Lusefollowed are:
1813d8a0b19SKarol Latecki
182da58800fSPaul Luse* Metadata operations (API with MD in the name) should be isolated from each other as there is no internal locking on the
183da58800fSPaul Luse   memory structures affected by these API.
184da58800fSPaul Luse* Metadata operations should be isolated from conflicting IO operations (an example of a conflicting IO would be one that is
185da58800fSPaul Luse  reading/writing to an area of a blob that a metadata operation is deallocating).
186da58800fSPaul Luse* Asynchronous callbacks will always take place on the calling thread.
187da58800fSPaul Luse* No assumptions about IO ordering can be made regardless of how many or which threads were involved in the issuing.
188da58800fSPaul Luse
189da58800fSPaul Luse### Data Buffer Memory
190da58800fSPaul Luse
191da58800fSPaul LuseAs with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated
192da58800fSPaul Lusewith SPDK API.
193da58800fSPaul Luse
194da58800fSPaul Luse### Error Handling
195da58800fSPaul Luse
196da58800fSPaul LuseAsynchronous Blobstore callbacks all include an error number that should be checked; non-zero values
197b52b0204SGangCaoindicate an error. Synchronous calls will typically return an error value if applicable.
198da58800fSPaul Luse
199da58800fSPaul Luse### Asynchronous API
200da58800fSPaul Luse
201da58800fSPaul LuseAsynchronous callbacks will return control not immediately, but at the point in execution where no
202b07d3bd2SChen Zhenghuamore forward progress can be made without blocking.  Therefore, no assumptions can be made about the progress of
203da58800fSPaul Lusean asynchronous call until the callback has completed.
204da58800fSPaul Luse
205da58800fSPaul Luse### Xattrs
206da58800fSPaul Luse
207da58800fSPaul LuseSetting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata.
208da58800fSPaul LuseTherefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for
209da58800fSPaul Lusepersisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one
210da58800fSPaul Lusemore expensive call to synchronize and persist the values.
211da58800fSPaul Luse
212da58800fSPaul Luse### Synchronizing Metadata
213da58800fSPaul Luse
214da58800fSPaul LuseAs described earlier, there are two types of metadata in Blobstore, per blob and one global
215da58800fSPaul Lusemetadata for the Blobstore itself.  Only the per blob metadata can be explicitly synchronized via API. The global
216da58800fSPaul Lusemetadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of
217da58800fSPaul Lusean improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt
218da58800fSPaul Lusebased on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore
219da58800fSPaul Luseproperly via API.
220da58800fSPaul Luse
221da58800fSPaul Luse### Iterating Blobs
222da58800fSPaul Luse
223da58800fSPaul LuseMultiple examples of how to iterate through the blobs are included in the sample code and tools.
224da58800fSPaul LuseWorthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its
225da58800fSPaul Luselooking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking
226da58800fSPaul Lusethe full list.
227da58800fSPaul Luse
228da58800fSPaul Luse### The Super Blob
229da58800fSPaul Luse
230da58800fSPaul LuseThe super blob is simply a single blob ID that can be stored as part of the global metadata to act
231da58800fSPaul Luseas sort of a "root" blob. The application may choose to use this blob to store any information that it needs or finds
232da58800fSPaul Luserelevant in understanding any kind of structure for what is on the Blobstore.
233da58800fSPaul Luse
234da58800fSPaul Luse## Examples {#blob_pg_examples}
235da58800fSPaul Luse
236da58800fSPaul LuseThere are multiple examples of Blobstore usage in the [repo](https://github.com/spdk/spdk):
237da58800fSPaul Luse
238da58800fSPaul Luse* **Hello World**: Actually named `hello_blob.c` this is a very basic example of a single threaded application that
239da58800fSPaul Luse  does nothing more than demonstrate the very basic API. Although Blobstore is optimized for NVMe, this example uses
240da58800fSPaul Luse  a RAM disk (malloc) back-end so that it can be executed easily in any development environment. The malloc back-end
241fd50b507SDarek Stojaczyk  is a `bdev` module thus this example uses not only the SPDK Framework but the `bdev` layer as well.
242da58800fSPaul Luse
243da58800fSPaul Luse* **CLI**: The `blobcli.c` example is command line utility intended to not only serve as example code but as a test
244da58800fSPaul Luse  and development tool for Blobstore itself. It is also a simple single threaded application that relies on both the
245da58800fSPaul Luse  SPDK Framework and the `bdev` layer but offers multiple modes of operation to accomplish some real-world tasks. In
246da58800fSPaul Luse  command mode, it accepts single-shot commands which can be a little time consuming if there are many commands to
247da58800fSPaul Luse  get through as each one will take a few seconds waiting for DPDK initialization. It therefore has a shell mode that
248da58800fSPaul Luse  allows the developer to get to a `blob>` prompt and then very quickly interact with Blobstore with simple commands
249b07d3bd2SChen Zhenghua  that include the ability to import/export blobs from/to regular files. Lastly there is a scripting mode to automate
250da58800fSPaul Luse  a series of tasks, again, handy for development and/or test type activities.
251da58800fSPaul Luse
252da58800fSPaul Luse## Configuration {#blob_pg_config}
253da58800fSPaul Luse
254da58800fSPaul LuseBlobstore configuration options are described in the initialization options section under @ref blob_pg_design.
255da58800fSPaul Luse
256da58800fSPaul Luse## Component Detail {#blob_pg_component}
257da58800fSPaul Luse
258da58800fSPaul LuseThe information in this section is not necessarily relevant to designing an application for use with Blobstore, but
259da58800fSPaul Luseunderstanding a little more about the internals may be interesting and is also included here for those wanting to
260da58800fSPaul Lusecontribute to the Blobstore effort itself.
261da58800fSPaul Luse
262da58800fSPaul Luse### Media Format
263da58800fSPaul Luse
264da58800fSPaul LuseThe Blobstore owns the entire storage device. The device is divided into clusters starting from the beginning, such
265da58800fSPaul Lusethat cluster 0 begins at the first logical block.
2661a787169SDaniel Verkamp
267111d4276SMaciej Wawryk```text
2681a787169SDaniel VerkampLBA 0                                   LBA N
2691a787169SDaniel Verkamp+-----------+-----------+-----+-----------+
2701a787169SDaniel Verkamp| Cluster 0 | Cluster 1 | ... | Cluster N |
2711a787169SDaniel Verkamp+-----------+-----------+-----+-----------+
272111d4276SMaciej Wawryk```
2731a787169SDaniel Verkamp
274da58800fSPaul LuseCluster 0 is special and has the following format, where page 0 is the first page of the cluster:
2751a787169SDaniel Verkamp
276111d4276SMaciej Wawryk```text
2771a787169SDaniel Verkamp+--------+-------------------+
2781a787169SDaniel Verkamp| Page 0 | Page 1 ... Page N |
2791a787169SDaniel Verkamp+--------+-------------------+
2801a787169SDaniel Verkamp| Super  |  Metadata Region  |
2811a787169SDaniel Verkamp| Block  |                   |
2821a787169SDaniel Verkamp+--------+-------------------+
283111d4276SMaciej Wawryk```
2841a787169SDaniel Verkamp
285da58800fSPaul LuseThe super block is a single page located at the beginning of the partition. It contains basic information about
286da58800fSPaul Lusethe Blobstore. The metadata region is the remainder of cluster 0 and may extend to additional clusters. Refer
2871f813ec3SChen Wangto the latest source code for complete structural details of the super block and metadata region.
2881a787169SDaniel Verkamp
289da58800fSPaul LuseEach blob is allocated a non-contiguous set of pages inside the metadata region for its metadata. These pages
290da58800fSPaul Luseform a linked list. The first page in the list will be written in place on update, while all other pages will
291da58800fSPaul Lusebe written to fresh locations. This requires the backing device to support an atomic write size greater than
292da58800fSPaul Luseor equal to the page size to guarantee that the operation is atomic. See the section on atomicity for details.
2931a787169SDaniel Verkamp
294353252b1STomasz Zawadzki### Blob cluster layout {#blob_pg_cluster_layout}
295353252b1STomasz Zawadzki
296353252b1STomasz ZawadzkiEach blob is an ordered list of clusters, where starting LBA of a cluster is called extent. A blob can be
297353252b1STomasz Zawadzkithin provisioned, resulting in no extent for some of the clusters. When first write operation occurs
298353252b1STomasz Zawadzkito the unallocated cluster - new extent is chosen. This information is stored in RAM and on-disk.
299353252b1STomasz Zawadzki
300353252b1STomasz ZawadzkiThere are two extent representations on-disk, dependent on `use_extent_table` (default:true) opts used
301353252b1STomasz Zawadzkiwhen creating a blob.
3023d8a0b19SKarol Latecki
303353252b1STomasz Zawadzki* **use_extent_table=true**: EXTENT_PAGE descriptor is not part of linked list of pages. It contains extents
304353252b1STomasz Zawadzki  that are not run-length encoded. Each extent page is referenced by EXTENT_TABLE descriptor, which is serialized
305353252b1STomasz Zawadzki  as part of linked list of pages.  Extent table is run-length encoding all unallocated extent pages.
306353252b1STomasz Zawadzki  Every new cluster allocation updates a single extent page, in case when extent page was previously allocated.
307353252b1STomasz Zawadzki  Otherwise additionally incurs serializing whole linked list of pages for the blob.
308353252b1STomasz Zawadzki
309353252b1STomasz Zawadzki* **use_extent_table=false**: EXTENT_RLE descriptor is serialized as part of linked list of pages.
310353252b1STomasz Zawadzki  Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
311353252b1STomasz Zawadzki  Every new cluster allocation incurs serializing whole linked list of pages for the blob.
312353252b1STomasz Zawadzki
31345e0a2a3SMike Gerdts### Thin Blobs, Snapshots, and Clones
31445e0a2a3SMike Gerdts
31545e0a2a3SMike GerdtsEach in-use cluster is allocated to blobstore metadata or to a particular blob. Once a cluster is
31645e0a2a3SMike Gerdtsallocated to a blob it is considered owned by that blob and that particular blob's metadata
31745e0a2a3SMike Gerdtsmaintains a reference to the cluster as a record of ownership. Cluster ownership is transferred
31845e0a2a3SMike Gerdtsduring snapshot operations described later in @ref blob_pg_snapshots.
31945e0a2a3SMike Gerdts
32045e0a2a3SMike GerdtsThrough the use of thin provisioning, snapshots, and/or clones, a blob may be backed by clusters it
32145e0a2a3SMike Gerdtsowns, clusters owned by another blob, or by a zeroes device. The behavior of reads and writes depend
32245e0a2a3SMike Gerdtson whether the operation targets blocks that are backed by a cluster owned by the blob or not.
32345e0a2a3SMike Gerdts
32445e0a2a3SMike Gerdts* **read from blocks on an owned cluster**: The read is serviced by reading directly from the
32545e0a2a3SMike Gerdts  appropriate cluster.
32645e0a2a3SMike Gerdts* **read from other blocks**: The read is passed on to the blob's *back device* and the back
32745e0a2a3SMike Gerdts  device services the read. The back device may be another blob or it may be a zeroes device.
32845e0a2a3SMike Gerdts* **write to blocks on an owned cluster**: The write is serviced by writing directly to the
32945e0a2a3SMike Gerdts  appropriate cluster.
33045e0a2a3SMike Gerdts* **write to thin provisioned cluster**: If the back device is the zeroes device and no cluster
33145e0a2a3SMike Gerdts  is allocated to the blob the process described in @ref blob_pg_thin_provisioning is followed.
33245e0a2a3SMike Gerdts* **write to other blocks**: A copy-on-write operation is triggered. See @ref blob_pg_copy_on_write
33345e0a2a3SMike Gerdts  for details.
33445e0a2a3SMike Gerdts
335ce67e0c7SMike GerdtsExternal snapshots allow some external data source to act as a snapshot. This allows clones to be
336ce67e0c7SMike Gerdtscreated of data that resides outside of the blobstore containing the clone.
337ce67e0c7SMike Gerdts
33845e0a2a3SMike Gerdts#### Thin Provisioning {#blob_pg_thin_provisioning}
33945e0a2a3SMike Gerdts
34045e0a2a3SMike GerdtsAs mentioned in @ref blob_pg_cluster_layout, a blob may be thin provisioned. A thin provisioned blob
34145e0a2a3SMike Gerdtsstarts out with no allocated clusters. Clusters are allocated as writes occur. A thin provisioned
34245e0a2a3SMike Gerdtsblob's back device is a *zeroes device*. A read from a zeroes device fills the read buffer with
34345e0a2a3SMike Gerdtszeroes.
34445e0a2a3SMike Gerdts
34545e0a2a3SMike GerdtsWhen a thin provisioned volume writes to a block that does not have an allocated cluster, the
34645e0a2a3SMike Gerdtsfollowing steps are performed:
34745e0a2a3SMike Gerdts
34845e0a2a3SMike Gerdts1. Allocate a cluster.
34945e0a2a3SMike Gerdts2. Update blob metadata.
35045e0a2a3SMike Gerdts3. Perform the write.
35145e0a2a3SMike Gerdts
35245e0a2a3SMike Gerdts#### Snapshots and Clones {#blob_pg_snapshots}
35345e0a2a3SMike Gerdts
35445e0a2a3SMike GerdtsA snapshot is a read-only blob that may have clones. A snapshot may itself be a clone of one other
35545e0a2a3SMike Gerdtsblob. While the interface gives the illusion of being able to create many snapshots of a blob, under
35645e0a2a3SMike Gerdtsthe covers this results in a chain of snapshots that are clones of the previous snapshot.
35745e0a2a3SMike Gerdts
35845e0a2a3SMike GerdtsWhen blob1 is snapshotted, a new read-only blob is created and blob1 becomes a clone of this new
35945e0a2a3SMike Gerdtsblob. That is:
36045e0a2a3SMike Gerdts
36145e0a2a3SMike Gerdts| Step | Action                         | State                                             |
36245e0a2a3SMike Gerdts| ---- | ------------------------------ | ------------------------------------------------- |
36345e0a2a3SMike Gerdts| 1    | Create blob1                   | `blob1 (rw)`                                      |
36445e0a2a3SMike Gerdts| 2    | Create snapshot blob2 of blob1 | `blob1 (rw) --> blob2 (ro)`                       |
36545e0a2a3SMike Gerdts| 2a   | Write to blob1                 | `blob1 (rw) --> blob2 (ro)`                       |
36645e0a2a3SMike Gerdts| 3    | Create snapshot blob3 of blob1 | `blob1 (rw) --> blob3 (ro) ---> blob2 (ro)`       |
36745e0a2a3SMike Gerdts
36845e0a2a3SMike GerdtsSupposing blob1 was not thin provisioned, step 1 would have allocated clusters needed to perform a
36945e0a2a3SMike Gerdtsfull write of blob1. As blob2 is created in step 2, the ownership of all of blob1's clusters is
37045e0a2a3SMike Gerdtstransferred to blob2 and blob2 becomes blob1's back device. During step2a, the writes to blob1 cause
37145e0a2a3SMike Gerdtsone or more clusters to be allocated to blob1. When blob3 is created in step 3, the clusters
37245e0a2a3SMike Gerdtsallocated in step 2a are given to blob3, blob3's back device becomes blob2, and blob1's back device
37345e0a2a3SMike Gerdtsbecomes blob3.
37445e0a2a3SMike Gerdts
37545e0a2a3SMike GerdtsIt is important to understand the chain above when considering strategies to use a golden image from
37645e0a2a3SMike Gerdtswhich many clones are made. The IO path is more efficient if one snapshot is cloned many times than
37745e0a2a3SMike Gerdtsit is to create a new snapshot for every clone. The following illustrates the difference.
37845e0a2a3SMike Gerdts
37945e0a2a3SMike GerdtsUsing a single snapshot means the data originally referenced by the golden image is always one hop
38045e0a2a3SMike Gerdtsaway.
38145e0a2a3SMike Gerdts
38245e0a2a3SMike Gerdts```text
38345e0a2a3SMike Gerdtscreate golden                           golden --> golden-snap
38445e0a2a3SMike Gerdtssnapshot golden as golden-snap                     ^ ^ ^
38545e0a2a3SMike Gerdtsclone golden-snap as clone1              clone1 ---+ | |
38645e0a2a3SMike Gerdtsclone golden-snap as clone2              clone2 -----+ |
38745e0a2a3SMike Gerdtsclone golden-snap as clone3              clone3 -------+
38845e0a2a3SMike Gerdts```
38945e0a2a3SMike Gerdts
39045e0a2a3SMike GerdtsUsing a snapshot per clone means that the chain of back devices grows with every new snapshot and
39145e0a2a3SMike Gerdtsclone pair. Reading a block from clone3 may result in a read from clone3's back device (snap3), from
39245e0a2a3SMike Gerdtsclone2's back device (snap2), then finally clone1's back device (snap1, the current owner of the
39345e0a2a3SMike Gerdtsblocks originally allocated to golden).
39445e0a2a3SMike Gerdts
39545e0a2a3SMike Gerdts```text
39645e0a2a3SMike Gerdtscreate golden
39745e0a2a3SMike Gerdtssnapshot golden as snap1                golden --> snap3 -----> snap2 ----> snap1
39845e0a2a3SMike Gerdtsclone snap1 as clone1                   clone3----/   clone2 --/  clone1 --/
39945e0a2a3SMike Gerdtssnapshot golden as snap2
40045e0a2a3SMike Gerdtsclone snap2 as clone2
40145e0a2a3SMike Gerdtssnapshot golden as snap3
40245e0a2a3SMike Gerdtsclone snap3 as clone3
40345e0a2a3SMike Gerdts```
40445e0a2a3SMike Gerdts
40545e0a2a3SMike GerdtsA snapshot with no more than one clone can be deleted. When a snapshot with one clone is deleted,
40645e0a2a3SMike Gerdtsthe clone becomes a regular blob. The clusters owned by the snapshot are transferred to the clone or
40745e0a2a3SMike Gerdtsfreed, depending on whether the clone already owns a cluster for a particular block range.
40845e0a2a3SMike Gerdts
40945e0a2a3SMike GerdtsRemoval of the last clone leaves the snapshot in place. This snapshot continues to be read-only and
41045e0a2a3SMike Gerdtscan serve as the snapshot for future clones.
41145e0a2a3SMike Gerdts
41245e0a2a3SMike Gerdts#### Inflating and Decoupling Clones
41345e0a2a3SMike Gerdts
41445e0a2a3SMike GerdtsA clone can remove its dependence on a snapshot with the following operations:
41545e0a2a3SMike Gerdts
41645e0a2a3SMike Gerdts1. Inflate the clone. Clusters backed by any snapshot or a zeroes device are copied into newly
41745e0a2a3SMike Gerdts   allocated clusters. The blob becomes a thick provisioned blob.
41845e0a2a3SMike Gerdts2. Decouple the clone. Clusters backed by the first back device snapshot are copied into newly
41945e0a2a3SMike Gerdts   allocated clusters. If the clone's back device snapshot was itself a clone of another
42045e0a2a3SMike Gerdts   snapshot, the clone remains a clone but is now a clone of a different snapshot.
42145e0a2a3SMike Gerdts3. Remove the snapshot. This is only possible if the snapshot has one clone. The end result is
42245e0a2a3SMike Gerdts   usually the same as decoupling but ownership of clusters is transferred from the snapshot rather
42345e0a2a3SMike Gerdts   than being copied. If the snapshot that was deleted was itself a clone of another snapshot, the
42445e0a2a3SMike Gerdts   clone remains a clone, but is now a clone of a different snapshot.
42545e0a2a3SMike Gerdts
426ce67e0c7SMike Gerdts#### External Snapshots and Esnap Clones {#blob_pg_esnap_and_esnap_clone}
427ce67e0c7SMike Gerdts
428ce67e0c7SMike GerdtsA blobstore that is loaded with the `esnap_bs_dev_create` callback defined will support external
429ce67e0c7SMike Gerdtssnapshots (esnaps). An external snapshot is not useful on its own: it needs to be cloned by a blob.
430ce67e0c7SMike GerdtsA clone of an external snapshot is referred to as an *esnap clone*. An esnap clone supports IO and
431ce67e0c7SMike Gerdtsother operations just like any other clone.
432ce67e0c7SMike Gerdts
433ce67e0c7SMike GerdtsAn esnap clone can be recognized in various ways:
434ce67e0c7SMike Gerdts
435ce67e0c7SMike Gerdts* **On disk**: the blob metadata has the `SPDK_BLOB_EXTERNAL_SNAPSHOT` (0x8) bit is set in
436ce67e0c7SMike Gerdts  `invalid_flags` and an internal XATTR with name `BLOB_EXTERNAL_SNAPSHOT_ID` ("EXTSNAP") exists.
437ce67e0c7SMike Gerdts* **In memory**: The `spdk_blob` structure contains the metadata read from disk, `blob->parent_id`
438ce67e0c7SMike Gerdts  is set to `SPDK_BLOBID_EXTERNAL_SNAPSHOT`, and `blob->back_bs_dev` references a blobstore device
439ce67e0c7SMike Gerdts  which is not a blob in the same blobstore nor a zeroes device.
440ce67e0c7SMike Gerdts
441b269b0edSDamiano Cipriani#### Shallow Copy {#blob_shallow_copy}
442b269b0edSDamiano Cipriani
443b269b0edSDamiano CiprianiA read only blob can be copied over a blob store device in a way that only clusters
444b269b0edSDamiano Ciprianiallocated to the blob will be written on the device. This device must have a size equal or greater
445b269b0edSDamiano Ciprianithan blob's size and blob store's block size must be an integer multiple of device's block size.
446b269b0edSDamiano CiprianiThis functionality can be used to recreate the entire snapshot stack of a blob into a different blob
447b269b0edSDamiano Ciprianistore.
448b269b0edSDamiano Cipriani
449*2acfb846SDamiano Cipriani#### Change the parent of a blob {#blob_reparent}
450*2acfb846SDamiano Cipriani
451*2acfb846SDamiano CiprianiWe can change the parent of a thin provisioned blob, making the blob a clone of a snapshot of the
452*2acfb846SDamiano Ciprianisame blobstore or a clone of an external snapshot. The previous parent of the blob can be a snapshot,
453*2acfb846SDamiano Ciprianian external snapshot or none.
454*2acfb846SDamiano Cipriani
455*2acfb846SDamiano CiprianiIf the new parent of the blob is a snapshot of the same blobstore, blob and snapshot must have the same number of clusters.
456*2acfb846SDamiano Cipriani
457*2acfb846SDamiano CiprianiIf the new parent of the blob is an external snapshot, the size of the esnap must be an integer multiple of
458*2acfb846SDamiano Ciprianiblob's cluster size.
459*2acfb846SDamiano Cipriani
46045e0a2a3SMike Gerdts#### Copy-on-write {#blob_pg_copy_on_write}
46145e0a2a3SMike Gerdts
46245e0a2a3SMike GerdtsA copy-on-write operation is somewhat expensive, with the cost being proportional to the cluster
46345e0a2a3SMike Gerdtssize. Typical copy-on-write involves the following steps:
46445e0a2a3SMike Gerdts
46545e0a2a3SMike Gerdts1. Allocate a cluster.
46645e0a2a3SMike Gerdts2. Allocate a cluster-sized buffer into which data can be read.
46745e0a2a3SMike Gerdts3. Trigger a full-cluster read from the back device into the cluster-sized buffer.
46845e0a2a3SMike Gerdts4. Write from the cluster-sized buffer into the newly allocated cluster.
46945e0a2a3SMike Gerdts5. Update the blob's on-disk metadata to record ownership of the newly allocated cluster. This
47045e0a2a3SMike Gerdts   involves at least one page-sized write.
47145e0a2a3SMike Gerdts6. Write the new data to the just allocated and copied cluster.
47245e0a2a3SMike Gerdts
47345e0a2a3SMike GerdtsIf the source cluster is backed by a zeroes device, steps 2 through 4 are skipped. Alternatively, if
47445e0a2a3SMike Gerdtsthe blobstore resides on a device that can perform the copy on its own, steps 2 through 4 are
475ce67e0c7SMike Gerdtsoffloaded to the device. Neither of these optimizations are available when the back device is an
476ce67e0c7SMike Gerdtsexternal snapshot.
47745e0a2a3SMike Gerdts
478da58800fSPaul Luse### Sequences and Batches
4791a787169SDaniel Verkamp
480da58800fSPaul LuseInternally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
481da58800fSPaul Lusea serial fashion or in parallel, respectively. Both are defined using the following structure:
4821a787169SDaniel Verkamp
483da58800fSPaul Luse~~~{.sh}
484da58800fSPaul Lusestruct spdk_bs_request_set;
485da58800fSPaul Luse~~~
4861a787169SDaniel Verkamp
487fd50b507SDarek StojaczykThese requests sets are basically bookkeeping mechanisms to help Blobstore efficiently deal with related groups
488da58800fSPaul Luseof IO. They are an internal construct only and are pre-allocated on a per channel basis (channels were discussed
489da58800fSPaul Luseearlier). They are removed from a channel associated linked list when the set (sequence or batch) is started and
490da58800fSPaul Lusethen returned to the list when completed.
4911a787169SDaniel Verkamp
492b47cee6cSMike GerdtsEach request set maintains a reference to a `channel` and a `back_channel`. The `channel` is used
493b47cee6cSMike Gerdtsfor performing IO on the blobstore device. The `back_channel` is used for performing IO on the
494b47cee6cSMike Gerdtsblob's back device, `blob->back_bs_dev`. For blobs that are not esnap clones, `channel` and
495b47cee6cSMike Gerdts`back_channel` reference an IO channel used with the device that contains the blobstore.  For blobs
496b47cee6cSMike Gerdtsthat are esnap clones, `channel` is the same as with any other blob and `back_channel` is an IO
497b47cee6cSMike Gerdtschannel for the external snapshot device.
498b47cee6cSMike Gerdts
499da58800fSPaul Luse### Key Internal Structures
5001a787169SDaniel Verkamp
501da58800fSPaul Luse`blobstore.h` contains many of the key structures for the internal workings of Blobstore. Only a few notable ones
502da58800fSPaul Luseare reviewed here.  Note that `blobstore.h` is an internal header file, the header file for Blobstore that defines
503da58800fSPaul Lusethe public API is `blob.h`.
5041a787169SDaniel Verkamp
505da58800fSPaul Luse~~~{.sh}
506da58800fSPaul Lusestruct spdk_blob
507da58800fSPaul Luse~~~
508fd50b507SDarek StojaczykThis is an in-memory data structure that contains key elements like the blob identifier, its current state and two
509da58800fSPaul Lusecopies of the mutable metadata for the blob; one copy is the current metadata and the other is the last copy written
510da58800fSPaul Luseto disk.
5111a787169SDaniel Verkamp
512da58800fSPaul Luse~~~{.sh}
513da58800fSPaul Lusestruct spdk_blob_mut_data
514da58800fSPaul Luse~~~
515da58800fSPaul LuseThis is a per blob structure, included the `struct spdk_blob` struct that actually defines the blob itself. It has the
516da58800fSPaul Lusespecific information on size and makeup of the blob (ie how many clusters are allocated for this blob and which ones.)
5171a787169SDaniel Verkamp
518da58800fSPaul Luse~~~{.sh}
519da58800fSPaul Lusestruct spdk_blob_store
520da58800fSPaul Luse~~~
521da58800fSPaul LuseThis is the main in-memory structure for the entire Blobstore. It defines the global on disk metadata region and maintains
522da58800fSPaul Luseinformation relevant to the entire system - initialization options such as cluster size, etc.
5231a787169SDaniel Verkamp
524da58800fSPaul Luse~~~{.sh}
525da58800fSPaul Lusestruct spdk_bs_super_block
526da58800fSPaul Luse~~~
527da58800fSPaul LuseThe super block is an on-disk structure that contains all of the relevant information that's in the in-memory Blobstore
528da58800fSPaul Lusestructure just discussed along with other elements one would expect to see here such as signature, version, checksum, etc.
5291a787169SDaniel Verkamp
530da58800fSPaul Luse### Code Layout and Common Conventions
5311a787169SDaniel Verkamp
532da58800fSPaul LuseIn general, `Blobstore.c` is laid out with groups of related functions blocked together with descriptive comments. For
533da58800fSPaul Luseexample,
5341a787169SDaniel Verkamp
535da58800fSPaul Luse~~~{.sh}
536da58800fSPaul Luse/* START spdk_bs_md_delete_blob */
537da58800fSPaul Luse< relevant functions to accomplish the deletion of a blob >
538da58800fSPaul Luse/* END spdk_bs_md_delete_blob */
539da58800fSPaul Luse~~~
540d12ba75bSJim Harris
541da58800fSPaul LuseAnd for the most part the following conventions are followed throughout:
5423d8a0b19SKarol Latecki
543da58800fSPaul Luse* functions beginning with an underscore are called internally only
544da58800fSPaul Luse* functions or variables with the letters `cpl` are related to set or callback completions
545