xref: /spdk/doc/blob.md (revision 712a3f69d32632bf6c862f00200f7f437d3f7529)
1 # Blobstore Programmer's Guide {#blob}
2 
3 # In this document {#blob_pg_toc}
4 
5 * @ref blob_pg_audience
6 * @ref blob_pg_intro
7 * @ref blob_pg_theory
8 * @ref blob_pg_design
9 * @ref blob_pg_examples
10 * @ref blob_pg_config
11 * @ref blob_pg_component
12 
13 ## Target Audience {#blob_pg_audience}
14 
15 The programmer's guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is
16 intended to supplement the source code in providing an overall understanding of how to integrate Blobstore into
17 an application as well as provide some high level insight into how Blobstore works behind the scenes. It is not
18 intended to serve as a design document or an API reference and in some cases source code snippets and high level
19 sequences will be discussed; for the latest source code reference refer to the [repo](https://github.com/spdk).
20 
21 ## Introduction {#blob_pg_intro}
22 
23 Blobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system
24 backing a higher level storage service, typically in lieu of a traditional filesystem. These higher level services
25 can be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or
26 distributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however,
27 and it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead
28 using the term 'blob'. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to
29 groups of blocks on a block device called 'blobs'. Blobs are typically large, measured in at least hundreds of
30 kilobytes, and are always a multiple of the underlying block size.
31 
32 The Blobstore is designed primarily to run on "next generation" media, which means the device supports fast random
33 reads and writes, with no required background garbage collection. However, in practice the design will run well on
34 NAND too.
35 
36 ## Theory of Operation {#blob_pg_theory}
37 
38 ### Abstractions
39 
40 The Blobstore defines a hierarchy of storage abstractions as follows.
41 
42 * **Logical Block**: Logical blocks are exposed by the disk itself, which are numbered from 0 to N, where N is the
43   number of blocks in the disk. A logical block is typically either 512B or 4KiB.
44 * **Page**: A page is defined to be a fixed number of logical blocks defined at Blobstore creation time. The logical
45   blocks that compose a page are always contiguous. Pages are also numbered from the beginning of the disk such
46   that the first page worth of blocks is page 0, the second page is page 1, etc. A page is typically 4KiB in size,
47   so this is either 8 or 1 logical blocks in practice. The SSD must be able to perform atomic reads and writes of
48   at least the page size.
49 * **Cluster**: A cluster is a fixed number of pages defined at Blobstore creation time. The pages that compose a cluster
50   are always contiguous. Clusters are also numbered from the beginning of the disk, where cluster 0 is the first cluster
51   worth of pages, cluster 1 is the second grouping of pages, etc. A cluster is typically 1MiB in size, or 256 pages.
52 * **Blob**: A blob is an ordered list of clusters. Blobs are manipulated (created, sized, deleted, etc.) by the application
53   and persist across power failures and reboots. Applications use a Blobstore provided identifier to access a particular blob.
54   Blobs are read and written in units of pages by specifying an offset from the start of the blob. Applications can also
55   store metadata in the form of key/value pairs with each blob which we'll refer to as xattrs (extended attributes).
56 * **Blobstore**: An SSD which has been initialized by a Blobstore-based application is referred to as "a Blobstore." A
57   Blobstore owns the entire underlying device which is made up of a private Blobstore metadata region and the collection of
58   blobs as managed by the application.
59 
60 @htmlonly
61 
62   <div id="blob_hierarchy"></div>
63 
64   <script>
65     let elem = document.getElementById('blob_hierarchy');
66 
67     let canvasWidth = 800;
68     let canvasHeight = 200;
69     var two = new Two({ width: 800, height: 200 }).appendTo(elem);
70 
71     var blobRect = two.makeRectangle(canvasWidth / 2, canvasHeight / 2, canvasWidth, canvasWidth);
72     blobRect.fill = '#7ED3F7';
73 
74     var blobText = two.makeText('Blob', canvasWidth / 2, 10, { alignment: 'center'});
75 
76     for (var i = 0; i < 2; i++) {
77         let clusterWidth = 400;
78         let clusterHeight = canvasHeight;
79         var clusterRect = two.makeRectangle((clusterWidth / 2) + (i * clusterWidth),
80                                             clusterHeight / 2,
81                                             clusterWidth - 10,
82                                             clusterHeight - 50);
83         clusterRect.fill = '#00AEEF';
84 
85         var clusterText =  two.makeText('Cluster',
86                                         (clusterWidth / 2) + (i * clusterWidth),
87                                         35,
88                                         { alignment: 'center', fill: 'white' });
89 
90         for (var j = 0; j < 4; j++) {
91             let pageWidth = 100;
92             let pageHeight = canvasHeight;
93             var pageRect = two.makeRectangle((pageWidth / 2) + (j * pageWidth) + (i * clusterWidth),
94                                              pageHeight / 2,
95                                              pageWidth - 20,
96                                              pageHeight - 100);
97             pageRect.fill = '#003C71';
98 
99             var pageText =  two.makeText('Page',
100                                          (pageWidth / 2) + (j * pageWidth) + (i * clusterWidth),
101                                          pageHeight / 2,
102                                          { alignment: 'center', fill: 'white' });
103         }
104     }
105 
106     two.update();
107   </script>
108 
109 @endhtmlonly
110 
111 ### Atomicity
112 
113 For all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic
114 operations of at least one page size. Atomicity here can refer to multiple operations:
115 
116 * **Data Writes**: For the case of data writes, the unit of atomicity is one page. Therefore if a write operation of
117   greater than one page is underway and the system suffers a power failure, the data on media will be consistent at a page
118   size granularity (if a single page were in the middle of being updated when power was lost, the data at that page location
119   will be as it was prior to the start of the write operation following power restoration.)
120 * **Blob Metadata Updates**: Each blob has its own set of metadata (xattrs, size, etc). For performance reasons, a copy of
121   this metadata is kept in RAM and only synchronized with the on-disk version when the application makes an explicit call to
122   do so, or when the Blobstore is unloaded. Therefore, setting of an xattr, for example is not consistent until the call to
123   synchronize it (covered later) which is, however, performed atomically.
124 * **Blobstore Metadata Updates**: Blobstore itself has its own metadata which, like per blob metadata, has a copy in both
125   RAM and on-disk. Unlike the per blob metadata, however, the Blobstore metadata region is not made consistent via a blob
126   synchronization call, it is only synchronized when the Blobstore is properly unloaded via API. Therefore, if the Blobstore
127   metadata is updated (blob creation, deletion, resize, etc.) and not unloaded properly, it will need to perform some extra
128   steps the next time it is loaded which will take a bit more time than it would have if shutdown cleanly, but there will be
129   no inconsistencies.
130 
131 ### Callbacks
132 
133 Blobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will
134 not block but instead return control at that point and make a call to the callback function provided in the API, along with
135 arguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on
136 threads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of
137 asynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling
138 is required before the IO is completed.
139 
140 ### Backend Support
141 
142 Blobstore requires a backing storage device that can be integrated using the `bdev` layer, or by directly integrating a
143 device driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers
144 supplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O
145 to the bdev layer is available in `bdev_blob.c`.  Alternatively, for example, the SPDK NVMe driver may be directly integrated
146 bypassing a small amount of `bdev` layer overhead. These options will be discussed further in the upcoming section on examples.
147 
148 ### Metadata Operations
149 
150 Because Blobstore is designed to be lock-free, metadata operations need to be isolated to a single
151 thread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along
152 with other data). In Blobstore this is implemented as `the metadata thread` and is defined to be the thread on which the
153 application makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on
154 and to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads.
155 This will be discussed further in the Design Considerations section.
156 
157 ### Threads
158 
159 An application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios.
160 The simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a
161 single core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and
162 it would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with
163 affected IO operations.
164 
165 ### Channels
166 
167 Channels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are
168 required in order to do IO.  The application will perform IO to the channel and channels are best thought of as being
169 associated 1:1 with a thread.
170 
171 ### Blob Identifiers
172 
173 When an application creates a blob, it does not provide a name as is the case with many other similar
174 storage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to
175 perform operations on the Blobstore.
176 
177 ## Design Considerations {#blob_pg_design}
178 
179 ### Initialization Options
180 
181 When the Blobstore is initialized, there are multiple configuration options to consider. The
182 options and their defaults are:
183 
184 * **Cluster Size**: By default, this value is 1MB. The cluster size is required to be a multiple of page size and should be
185   selected based on the application’s usage model in terms of allocation. Recall that blobs are made up of clusters so when
186   a blob is allocated/deallocated or changes in size, disk LBAs will be manipulated in groups of cluster size.  If the
187   application is expecting to deal with mainly very large (always multiple GB) blobs then it may make sense to change the
188   cluster size to 1GB for example.
189 * **Number of Metadata Pages**: By default, Blobstore will assume there can be as many clusters as there are metadata pages
190   which is the worst case scenario in terms of metadata usage and can be overridden here however the space efficiency is
191   not significant.
192 * **Maximum Simultaneous Metadata Operations**: Determines how many internally pre-allocated memory structures are set
193   aside for performing metadata operations. It is unlikely that changes to this value (default 32) would be desirable.
194 * **Maximum Simultaneous Operations Per Channel**: Determines how many internally pre-allocated memory structures are set
195   aside for channel operations. Changes to this value would be application dependent and best determined by both a knowledge
196   of the typical usage model, an understanding of the types of SSDs being used and empirical data. The default is 512.
197 * **Blobstore Type**: This field is a character array to be used by applications that need to identify whether the
198   Blobstore found here is appropriate to claim or not. The default is NULL and unless the application is being deployed in
199   an environment where multiple applications using the same disks are at risk of inadvertently using the wrong Blobstore, there
200   is no need to set this value. It can, however, be set to any valid set of characters.
201 
202 ### Sub-page Sized Operations
203 
204 Blobstore is only capable of doing page sized read/write operations. If the application
205 requires finer granularity it will have to accommodate that itself.
206 
207 ### Threads
208 
209 As mentioned earlier, Blobstore can share a single thread with an application or the application
210 can define any number of threads, within resource constraints, that makes sense.  The basic considerations that must be
211 followed are:
212 
213 * Metadata operations (API with MD in the name) should be isolated from each other as there is no internal locking on the
214    memory structures affected by these API.
215 * Metadata operations should be isolated from conflicting IO operations (an example of a conflicting IO would be one that is
216   reading/writing to an area of a blob that a metadata operation is deallocating).
217 * Asynchronous callbacks will always take place on the calling thread.
218 * No assumptions about IO ordering can be made regardless of how many or which threads were involved in the issuing.
219 
220 ### Data Buffer Memory
221 
222 As with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated
223 with SPDK API.
224 
225 ### Error Handling
226 
227 Asynchronous Blobstore callbacks all include an error number that should be checked; non-zero values
228 indicate and error. Synchronous calls will typically return an error value if applicable.
229 
230 ### Asynchronous API
231 
232 Asynchronous callbacks will return control not immediately, but at the point in execution where no
233 more forward progress can be made without blocking.  Therefore, no assumptions can be made about the progress of
234 an asynchronous call until the callback has completed.
235 
236 ### Xattrs
237 
238 Setting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata.
239 Therefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for
240 persisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one
241 more expensive call to synchronize and persist the values.
242 
243 ### Synchronizing Metadata
244 
245 As described earlier, there are two types of metadata in Blobstore, per blob and one global
246 metadata for the Blobstore itself.  Only the per blob metadata can be explicitly synchronized via API. The global
247 metadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of
248 an improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt
249 based on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore
250 properly via API.
251 
252 ### Iterating Blobs
253 
254 Multiple examples of how to iterate through the blobs are included in the sample code and tools.
255 Worthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its
256 looking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking
257 the full list.
258 
259 ### The Super Blob
260 
261 The super blob is simply a single blob ID that can be stored as part of the global metadata to act
262 as sort of a "root" blob. The application may choose to use this blob to store any information that it needs or finds
263 relevant in understanding any kind of structure for what is on the Blobstore.
264 
265 ## Examples {#blob_pg_examples}
266 
267 There are multiple examples of Blobstore usage in the [repo](https://github.com/spdk/spdk):
268 
269 * **Hello World**: Actually named `hello_blob.c` this is a very basic example of a single threaded application that
270   does nothing more than demonstrate the very basic API. Although Blobstore is optimized for NVMe, this example uses
271   a RAM disk (malloc) back-end so that it can be executed easily in any development environment. The malloc back-end
272   is a `bdev` module thus this example uses not only the SPDK Framework but the `bdev` layer as well.
273 
274 * **CLI**: The `blobcli.c` example is command line utility intended to not only serve as example code but as a test
275   and development tool for Blobstore itself. It is also a simple single threaded application that relies on both the
276   SPDK Framework and the `bdev` layer but offers multiple modes of operation to accomplish some real-world tasks. In
277   command mode, it accepts single-shot commands which can be a little time consuming if there are many commands to
278   get through as each one will take a few seconds waiting for DPDK initialization. It therefore has a shell mode that
279   allows the developer to get to a `blob>` prompt and then very quickly interact with Blobstore with simple commands
280   that include the ability to import/export blobs from/to regular files. Lastly there is a scripting mode to automate
281   a series of tasks, again, handy for development and/or test type activities.
282 
283 ## Configuration {#blob_pg_config}
284 
285 Blobstore configuration options are described in the initialization options section under @ref blob_pg_design.
286 
287 ## Component Detail {#blob_pg_component}
288 
289 The information in this section is not necessarily relevant to designing an application for use with Blobstore, but
290 understanding a little more about the internals may be interesting and is also included here for those wanting to
291 contribute to the Blobstore effort itself.
292 
293 ### Media Format
294 
295 The Blobstore owns the entire storage device. The device is divided into clusters starting from the beginning, such
296 that cluster 0 begins at the first logical block.
297 
298     LBA 0                                   LBA N
299     +-----------+-----------+-----+-----------+
300     | Cluster 0 | Cluster 1 | ... | Cluster N |
301     +-----------+-----------+-----+-----------+
302 
303 Cluster 0 is special and has the following format, where page 0 is the first page of the cluster:
304 
305     +--------+-------------------+
306     | Page 0 | Page 1 ... Page N |
307     +--------+-------------------+
308     | Super  |  Metadata Region  |
309     | Block  |                   |
310     +--------+-------------------+
311 
312 The super block is a single page located at the beginning of the partition. It contains basic information about
313 the Blobstore. The metadata region is the remainder of cluster 0 and may extend to additional clusters. Refer
314 to the latest source code for complete structural details of the super block and metadata region.
315 
316 Each blob is allocated a non-contiguous set of pages inside the metadata region for its metadata. These pages
317 form a linked list. The first page in the list will be written in place on update, while all other pages will
318 be written to fresh locations. This requires the backing device to support an atomic write size greater than
319 or equal to the page size to guarantee that the operation is atomic. See the section on atomicity for details.
320 
321 ### Blob cluster layout {#blob_pg_cluster_layout}
322 
323 Each blob is an ordered list of clusters, where starting LBA of a cluster is called extent. A blob can be
324 thin provisioned, resulting in no extent for some of the clusters. When first write operation occurs
325 to the unallocated cluster - new extent is chosen. This information is stored in RAM and on-disk.
326 
327 There are two extent representations on-disk, dependent on `use_extent_table` (default:true) opts used
328 when creating a blob.
329 
330 * **use_extent_table=true**: EXTENT_PAGE descriptor is not part of linked list of pages. It contains extents
331   that are not run-length encoded. Each extent page is referenced by EXTENT_TABLE descriptor, which is serialized
332   as part of linked list of pages.  Extent table is run-length encoding all unallocated extent pages.
333   Every new cluster allocation updates a single extent page, in case when extent page was previously allocated.
334   Otherwise additionally incurs serializing whole linked list of pages for the blob.
335 
336 * **use_extent_table=false**: EXTENT_RLE descriptor is serialized as part of linked list of pages.
337   Extents pointing to contiguous LBA are run-length encoded, including unallocated extents represented by 0.
338   Every new cluster allocation incurs serializing whole linked list of pages for the blob.
339 
340 ### Sequences and Batches
341 
342 Internally Blobstore uses the concepts of sequences and batches to submit IO to the underlying device in either
343 a serial fashion or in parallel, respectively. Both are defined using the following structure:
344 
345 ~~~{.sh}
346 struct spdk_bs_request_set;
347 ~~~
348 
349 These requests sets are basically bookkeeping mechanisms to help Blobstore efficiently deal with related groups
350 of IO. They are an internal construct only and are pre-allocated on a per channel basis (channels were discussed
351 earlier). They are removed from a channel associated linked list when the set (sequence or batch) is started and
352 then returned to the list when completed.
353 
354 ### Key Internal Structures
355 
356 `blobstore.h` contains many of the key structures for the internal workings of Blobstore. Only a few notable ones
357 are reviewed here.  Note that `blobstore.h` is an internal header file, the header file for Blobstore that defines
358 the public API is `blob.h`.
359 
360 ~~~{.sh}
361 struct spdk_blob
362 ~~~
363 This is an in-memory data structure that contains key elements like the blob identifier, its current state and two
364 copies of the mutable metadata for the blob; one copy is the current metadata and the other is the last copy written
365 to disk.
366 
367 ~~~{.sh}
368 struct spdk_blob_mut_data
369 ~~~
370 This is a per blob structure, included the `struct spdk_blob` struct that actually defines the blob itself. It has the
371 specific information on size and makeup of the blob (ie how many clusters are allocated for this blob and which ones.)
372 
373 ~~~{.sh}
374 struct spdk_blob_store
375 ~~~
376 This is the main in-memory structure for the entire Blobstore. It defines the global on disk metadata region and maintains
377 information relevant to the entire system - initialization options such as cluster size, etc.
378 
379 ~~~{.sh}
380 struct spdk_bs_super_block
381 ~~~
382 The super block is an on-disk structure that contains all of the relevant information that's in the in-memory Blobstore
383 structure just discussed along with other elements one would expect to see here such as signature, version, checksum, etc.
384 
385 ### Code Layout and Common Conventions
386 
387 In general, `Blobstore.c` is laid out with groups of related functions blocked together with descriptive comments. For
388 example,
389 
390 ~~~{.sh}
391 /* START spdk_bs_md_delete_blob */
392 < relevant functions to accomplish the deletion of a blob >
393 /* END spdk_bs_md_delete_blob */
394 ~~~
395 
396 And for the most part the following conventions are followed throughout:
397 
398 * functions beginning with an underscore are called internally only
399 * functions or variables with the letters `cpl` are related to set or callback completions
400