xref: /spdk/doc/blob.md (revision b961d9cc12de49251d135307eaa05ec0fc9dd2fa)
1# Blobstore {#blob}
2
3## Introduction
4
5The blobstore is a persistent, power-fail safe block allocator designed to be
6used as the local storage system backing a higher level storage service,
7typically in lieu of a traditional filesystem. These higher level services can
8be local databases or key/value stores (MySQL, RocksDB), they can be dedicated
9appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It
10is not designed to be a general purpose filesystem, however, and it is
11intentionally not POSIX compliant. To avoid confusion, no reference to files or
12objects will be made at all, instead using the term 'blob'. The blobstore is
13designed to allow asynchronous, uncached, parallel reads and writes to groups
14of blocks on a block device called 'blobs'. Blobs are typically large,
15measured in at least hundreds of kilobytes, and are always a multiple of the
16underlying block size.
17
18The blobstore is designed primarily to run on "next generation" media, which
19means the device supports fast random reads _and_ writes, with no required
20background garbage collection. However, in practice the design will run well on
21NAND too. Absolutely no attempt will be made to make this efficient on spinning
22media.
23
24## Design Goals
25
26The blobstore is intended to solve a number of problems that local databases
27have when using traditional POSIX filesystems. These databases are assumed to
28'own' the entire storage device, to not need to track access times, and to
29require only a very simple directory hierarchy. These assumptions allow
30significant design optimizations over a traditional POSIX filesystem and block
31stack.
32
33Asynchronous I/O can be an order of magnitude or more faster than synchronous
34I/O, and so solutions like
35[libaio](https://git.fedorahosted.org/cgit/libaio.git/) have become popular.
36However, libaio is [not actually
37asynchronous](http://www.scylladb.com/2016/02/09/qualifying-filesystems/) in
38all cases. The blobstore will provide truly asynchronous operations in all
39cases without any hidden locks or stalls.
40
41With the advent of NVMe, storage devices now have a hardware interface that
42allows for highly parallel I/O submission from many threads with no locks.
43Unfortunately, placement of data on a device requires some central coordination
44to avoid conflicts. The blobstore will separate operations that require
45coordination from operations that do not, and allow users to explictly
46associate I/O with channels. Operations on different channels happen in
47parallel, all the way down to the hardware, with no locks or coordination.
48
49As media access latency improves, strategies for in-memory caching are changing
50and often the kernel page cache is a bottleneck. Many databases have moved to
51opening files only in O_DIRECT mode, avoiding the page cache entirely, and
52writing their own caching layer. With the introduction of next generation media
53and its additional expected latency reductions, this strategy will become far
54more prevalent. To support this, the blobstore will perform no in-memory
55caching of data at all, essentially making all blob operations conceptually
56equivalent to O_DIRECT. This means the blobstore has similar restrictions to
57O_DIRECT where data can only be read or written in units of pages (4KiB),
58although memory alignment requirements are much less strict than O_DIRECT (the
59pages can even be composed of scattered buffers). We fully expect that DRAM
60caching will remain critical to performance, but leave the specifics of the
61cache design to higher layers.
62
63Storage devices pull data from host memory using a DMA engine, and those DMA
64engines operate on physical addresses and often introduce alignment
65restrictions. Further, to avoid data corruption, the data must not be paged out
66by the operating system while it is being transferred to disk. Traditionally,
67operating systems solve this problem either by copying user data into special
68kernel buffers that were allocated for this purpose and the I/O operations are
69performed to/from there, or taking locks to mark all user pages as locked and
70unmovable. Historically, the time to perform the copy or locking was
71inconsequential relative to the I/O time at the storage device, but that is
72simply no longer the case. The blobstore will instead provide zero copy,
73lockless read and write access to the device. To do this, memory to be used for
74blob data must be registered with the blobstore up front, preferably at
75application start and out of the I/O path, so that it can be pinned, the
76physical addresses can be determined, and the alignment requirements can be
77verified.
78
79Hardware devices are necessarily limited to some maximum queue depth. For NVMe
80devices that can be quite large (the spec allows up to 64k!), but is typically
81much smaller (128 - 1024 per queue). Under heavy load, databases may generate
82enough requests to exceed the hardware queue depth, which requires queueing in
83software. For operating systems this is often done in the generic block layer
84and may cause unexpected stalls or require locks. The blobstore will avoid this
85by simply failing requests with an appropriate error code when the queue is
86full. This allows the blobstore to easily stick to its commitment to never
87block, but may require the user to provide their own queueing layer.
88
89The NVMe specification has added support for specifying priorities on the
90hardware queues. With a traditional filesystem and storage stack, however,
91there is no reasonable way to map an I/O from an arbitrary thread to a
92particular hardware queue to be processed with the priority requested. The
93blobstore solves this by allowing the user to create channels with priorities,
94which map directly to priorities on NVMe hardware queues. The user can then
95choose the priority for an I/O by sending it on the appropriate channel. This
96is incredibly useful for many databases where data intake operations need to
97run with a much higher priority than background scrub and compaction operations
98in order to stay within quality of service requirements. Note that many NVMe
99devices today do not yet support queue priorities, so the blobstore considers
100this feature optional.
101
102## The Basics
103
104The blobstore defines a hierarchy of three units of disk space. The smallest are
105the *logical blocks* exposed by the disk itself, which are numbered from 0 to N,
106where N is the number of blocks in the disk. A logical block is typically
107either 512B or 4KiB.
108
109The blobstore defines a *page* to be a fixed number of logical blocks defined
110at blobstore creation time. The logical blocks that compose a page are
111contiguous. Pages are also numbered from the beginning of the disk such that
112the first page worth of blocks is page 0, the second page is page 1, etc. A
113page is typically 4KiB in size, so this is either 8 or 1 logical blocks in
114practice. The device must be able to perform atomic reads and writes of at
115least the page size.
116
117The largest unit is a *cluster*, which is a fixed number of pages defined at
118blobstore creation time. The pages that compose a cluster are contiguous.
119Clusters are also numbered from the beginning of the disk, where cluster 0 is
120the first cluster worth of pages, cluster 1 is the second grouping of pages,
121etc. A cluster is typically 1MiB in size, or 256 pages.
122
123On top of these three basic units, the blobstore defines three primitives. The
124most fundamental is the blob, where a blob is an ordered list of clusters plus
125an identifier. Blobs persist across power failures and reboots. The set of all
126blobs described by shared metadata is called the blobstore. I/O operations on
127blobs are submitted through a channel. Channels are tied to threads, but
128multiple threads can simultaneously submit I/O operations to the same blob on
129their own channels.
130
131Blobs are read and written in units of pages by specifying an offset in the
132virtual blob address space. This offset is translated by first determining
133which cluster(s) are being accessed, and then translating to a set of logical
134blocks. This translation is done trivially using only basic math - there is no
135mapping data structure. Unlike read and write, blobs are resized in units of
136clusters.
137
138Blobs are described by their metadata which consists of a discontiguous set of
139pages stored in a reserved region on the disk. Each page of metadata is
140referred to as a *metadata page*. Blobs do not share metadata pages with other
141blobs, and in fact the design relies on the backing storage device supporting
142an atomic write unit greater than or equal to the page size. Most devices
143backed by NAND and next generation media support this atomic write capability,
144but often magnetic media does not.
145
146The metadata region is fixed in size and defined upon creation of the
147blobstore. The size is configurable, but by default one page is allocated for
148each cluster. For 1MiB clusters and 4KiB pages, that results in 0.4% metadata
149overhead.
150
151## Conventions
152
153Data formats on the device are specified in [Backus-Naur
154Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form). All data is
155stored on media in little-endian format. Unspecified data must be zeroed.
156
157## Media Format
158
159The blobstore owns the entire storage device. The device is divided into
160clusters starting from the beginning, such that cluster 0 begins at the first
161logical block.
162
163    LBA 0                                   LBA N
164    +-----------+-----------+-----+-----------+
165    | Cluster 0 | Cluster 1 | ... | Cluster N |
166    +-----------+-----------+-----+-----------+
167
168Or in formal notation:
169
170    <media-format> ::= <cluster0> <cluster>*
171
172
173Cluster 0 is special and has the following format, where page 0
174is the first page of the cluster:
175
176    +--------+-------------------+
177    | Page 0 | Page 1 ... Page N |
178    +--------+-------------------+
179    | Super  |  Metadata Region  |
180    | Block  |                   |
181    +--------+-------------------+
182
183Or formally:
184
185    <cluster0> ::= <super-block> <metadata-region>
186
187The super block is a single page located at the beginning of the partition.
188It contains basic information about the blobstore. The metadata region
189is the remainder of cluster 0 and may extend to additional clusters.
190
191    <super-block> ::= <sb-version> <sb-len> <sb-super-blob> <sb-params>
192                      <sb-metadata-start> <sb-metadata-len>
193    <sb-version> ::= u32
194    <sb-len> ::= u32 # Length of this super block, in bytes. Starts from the
195                     # beginning of this structure.
196    <sb-super-blob> ::= u64 # Special blobid set by the user that indicates where
197                            # their starting metadata resides.
198
199    <sb-md-start> ::= u64 # Metadata start location, in pages
200    <sb-md-len> ::= u64 # Metadata length, in pages
201
202The `<sb-params>` data contains parameters specified by the user when the blob
203store was initially formatted.
204
205    <sb-params> ::= <sb-page-size> <sb-cluster-size>
206    <sb-page-size> ::= u32 # page size, in bytes.
207                           # Must be a multiple of the logical block size.
208                           # The implementation today requires this to be 4KiB.
209    <sb-cluster-size> ::= u32 # Cluster size, in bytes.
210                              # Must be a multiple of the page size.
211
212Each blob is allocated a non-contiguous set of pages inside the metadata region
213for its metadata. These pages form a linked list. The first page in the list
214will be written in place on update, while all other pages will be written to
215fresh locations. This requires the backing device to support an atomic write
216size greater than or equal to the page size to guarantee that the operation is
217atomic. See the section on atomicity for details.
218
219Each page is defined as:
220
221    <metadata-page> ::= <blob-id> <blob-sequence-num> <blob-descriptor>*
222                        <blob-next> <blob-crc>
223    <blob-id> ::= u64 # The blob guid
224    <blob-sequence-num> ::= u32 # The sequence number of this page in the linked
225                                # list.
226
227    <blob-descriptor> ::= <blob-descriptor-type> <blob-descriptor-length>
228                            <blob-descriptor-data>
229    <blob-descriptor-type> ::= u8 # 0 means padding, 1 means "extent", 2 means
230                                  # xattr. The type
231                                  # describes how to interpret the descriptor data.
232    <blob-descriptor-length> ::= u32 # Length of the entire descriptor
233
234    <blob-descriptor-data-padding> ::= u8
235
236    <blob-descriptor-data-extent> ::= <extent-cluster-id> <extent-cluster-count>
237    <extent-cluster-id> ::= u32 # The cluster id where this extent starts
238    <extent-cluster-count> ::= u32 # The number of clusters in this extent
239
240    <blob-descriptor-data-xattr> ::= <xattr-name-length> <xattr-value-length>
241                                     <xattr-name> <xattr-value>
242    <xattr-name-length> ::= u16
243    <xattr-value-length> ::= u16
244    <xattr-name> ::= u8*
245    <xattr-value> ::= u8*
246
247    <blob-next> ::= u32 # The offset into the metadata region that contains the
248                        # next page of metadata. 0 means no next page.
249    <blob-crc> ::= u32 # CRC of the entire page
250
251
252Descriptors cannot span metadata pages.
253
254## Atomicity
255
256Metadata in the blobstore is cached and must be explicitly synced by the user.
257Data is not cached, however, so when a write completes the data can be
258considered durable if the metadata is synchronized. Metadata does not often
259change, and in fact only must be synchronized after these explicit operations:
260
261* resize
262* set xattr
263* remove xattr
264
265Any other operation will not dirty the metadata. Further, the metadata for each
266blob is independent of all of the others, so a synchronization operation is
267only needed on the specific blob that is dirty.
268
269The metadata consists of a linked list of pages. Updates to the metadata are
270done by first writing page 2 through N to a new location, writing page 1 in
271place to atomically update the chain, and then erasing the remainder of the old
272chain. The vast majority of the time, blobs consist of just a single metadata
273page and so this operation is very efficient. For this scheme to work the write
274to the first page must be atomic, which requires hardware support from the
275backing device. For most, if not all, NVMe SSDs, an atomic write unit of 4KiB
276can be expected. Devices specify their atomic write unit in their NVMe identify
277data - specifically in the AWUN field.
278