1# Blobstore {#blob} 2 3## Introduction 4 5The blobstore is a persistent, power-fail safe block allocator designed to be 6used as the local storage system backing a higher level storage service, 7typically in lieu of a traditional filesystem. These higher level services can 8be local databases or key/value stores (MySQL, RocksDB), they can be dedicated 9appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It 10is not designed to be a general purpose filesystem, however, and it is 11intentionally not POSIX compliant. To avoid confusion, no reference to files or 12objects will be made at all, instead using the term 'blob'. The blobstore is 13designed to allow asynchronous, uncached, parallel reads and writes to groups 14of blocks on a block device called 'blobs'. Blobs are typically large, 15measured in at least hundreds of kilobytes, and are always a multiple of the 16underlying block size. 17 18The blobstore is designed primarily to run on "next generation" media, which 19means the device supports fast random reads _and_ writes, with no required 20background garbage collection. However, in practice the design will run well on 21NAND too. Absolutely no attempt will be made to make this efficient on spinning 22media. 23 24## Design Goals 25 26The blobstore is intended to solve a number of problems that local databases 27have when using traditional POSIX filesystems. These databases are assumed to 28'own' the entire storage device, to not need to track access times, and to 29require only a very simple directory hierarchy. These assumptions allow 30significant design optimizations over a traditional POSIX filesystem and block 31stack. 32 33Asynchronous I/O can be an order of magnitude or more faster than synchronous 34I/O, and so solutions like 35[libaio](https://git.fedorahosted.org/cgit/libaio.git/) have become popular. 36However, libaio is [not actually 37asynchronous](http://www.scylladb.com/2016/02/09/qualifying-filesystems/) in 38all cases. The blobstore will provide truly asynchronous operations in all 39cases without any hidden locks or stalls. 40 41With the advent of NVMe, storage devices now have a hardware interface that 42allows for highly parallel I/O submission from many threads with no locks. 43Unfortunately, placement of data on a device requires some central coordination 44to avoid conflicts. The blobstore will separate operations that require 45coordination from operations that do not, and allow users to explictly 46associate I/O with channels. Operations on different channels happen in 47parallel, all the way down to the hardware, with no locks or coordination. 48 49As media access latency improves, strategies for in-memory caching are changing 50and often the kernel page cache is a bottleneck. Many databases have moved to 51opening files only in O_DIRECT mode, avoiding the page cache entirely, and 52writing their own caching layer. With the introduction of next generation media 53and its additional expected latency reductions, this strategy will become far 54more prevalent. To support this, the blobstore will perform no in-memory 55caching of data at all, essentially making all blob operations conceptually 56equivalent to O_DIRECT. This means the blobstore has similar restrictions to 57O_DIRECT where data can only be read or written in units of pages (4KiB), 58although memory alignment requirements are much less strict than O_DIRECT (the 59pages can even be composed of scattered buffers). We fully expect that DRAM 60caching will remain critical to performance, but leave the specifics of the 61cache design to higher layers. 62 63Storage devices pull data from host memory using a DMA engine, and those DMA 64engines operate on physical addresses and often introduce alignment 65restrictions. Further, to avoid data corruption, the data must not be paged out 66by the operating system while it is being transferred to disk. Traditionally, 67operating systems solve this problem either by copying user data into special 68kernel buffers that were allocated for this purpose and the I/O operations are 69performed to/from there, or taking locks to mark all user pages as locked and 70unmovable. Historically, the time to perform the copy or locking was 71inconsequential relative to the I/O time at the storage device, but that is 72simply no longer the case. The blobstore will instead provide zero copy, 73lockless read and write access to the device. To do this, memory to be used for 74blob data must be registered with the blobstore up front, preferably at 75application start and out of the I/O path, so that it can be pinned, the 76physical addresses can be determined, and the alignment requirements can be 77verified. 78 79Hardware devices are necessarily limited to some maximum queue depth. For NVMe 80devices that can be quite large (the spec allows up to 64k!), but is typically 81much smaller (128 - 1024 per queue). Under heavy load, databases may generate 82enough requests to exceed the hardware queue depth, which requires queueing in 83software. For operating systems this is often done in the generic block layer 84and may cause unexpected stalls or require locks. The blobstore will avoid this 85by simply failing requests with an appropriate error code when the queue is 86full. This allows the blobstore to easily stick to its commitment to never 87block, but may require the user to provide their own queueing layer. 88 89The NVMe specification has added support for specifying priorities on the 90hardware queues. With a traditional filesystem and storage stack, however, 91there is no reasonable way to map an I/O from an arbitrary thread to a 92particular hardware queue to be processed with the priority requested. The 93blobstore solves this by allowing the user to create channels with priorities, 94which map directly to priorities on NVMe hardware queues. The user can then 95choose the priority for an I/O by sending it on the appropriate channel. This 96is incredibly useful for many databases where data intake operations need to 97run with a much higher priority than background scrub and compaction operations 98in order to stay within quality of service requirements. Note that many NVMe 99devices today do not yet support queue priorities, so the blobstore considers 100this feature optional. 101 102## The Basics 103 104The blobstore defines a hierarchy of three units of disk space. The smallest are 105the *logical blocks* exposed by the disk itself, which are numbered from 0 to N, 106where N is the number of blocks in the disk. A logical block is typically 107either 512B or 4KiB. 108 109The blobstore defines a *page* to be a fixed number of logical blocks defined 110at blobstore creation time. The logical blocks that compose a page are 111contiguous. Pages are also numbered from the beginning of the disk such that 112the first page worth of blocks is page 0, the second page is page 1, etc. A 113page is typically 4KiB in size, so this is either 8 or 1 logical blocks in 114practice. The device must be able to perform atomic reads and writes of at 115least the page size. 116 117The largest unit is a *cluster*, which is a fixed number of pages defined at 118blobstore creation time. The pages that compose a cluster are contiguous. 119Clusters are also numbered from the beginning of the disk, where cluster 0 is 120the first cluster worth of pages, cluster 1 is the second grouping of pages, 121etc. A cluster is typically 1MiB in size, or 256 pages. 122 123On top of these three basic units, the blobstore defines three primitives. The 124most fundamental is the blob, where a blob is an ordered list of clusters plus 125an identifier. Blobs persist across power failures and reboots. The set of all 126blobs described by shared metadata is called the blobstore. I/O operations on 127blobs are submitted through a channel. Channels are tied to threads, but 128multiple threads can simultaneously submit I/O operations to the same blob on 129their own channels. 130 131Blobs are read and written in units of pages by specifying an offset in the 132virtual blob address space. This offset is translated by first determining 133which cluster(s) are being accessed, and then translating to a set of logical 134blocks. This translation is done trivially using only basic math - there is no 135mapping data structure. Unlike read and write, blobs are resized in units of 136clusters. 137 138Blobs are described by their metadata which consists of a discontiguous set of 139pages stored in a reserved region on the disk. Each page of metadata is 140referred to as a *metadata page*. Blobs do not share metadata pages with other 141blobs, and in fact the design relies on the backing storage device supporting 142an atomic write unit greater than or equal to the page size. Most devices 143backed by NAND and next generation media support this atomic write capability, 144but often magnetic media does not. 145 146The metadata region is fixed in size and defined upon creation of the 147blobstore. The size is configurable, but by default one page is allocated for 148each cluster. For 1MiB clusters and 4KiB pages, that results in 0.4% metadata 149overhead. 150 151## Conventions 152 153Data formats on the device are specified in [Backus-Naur 154Form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form). All data is 155stored on media in little-endian format. Unspecified data must be zeroed. 156 157## Media Format 158 159The blobstore owns the entire storage device. The device is divided into 160clusters starting from the beginning, such that cluster 0 begins at the first 161logical block. 162 163 LBA 0 LBA N 164 +-----------+-----------+-----+-----------+ 165 | Cluster 0 | Cluster 1 | ... | Cluster N | 166 +-----------+-----------+-----+-----------+ 167 168Or in formal notation: 169 170 <media-format> ::= <cluster0> <cluster>* 171 172 173Cluster 0 is special and has the following format, where page 0 174is the first page of the cluster: 175 176 +--------+-------------------+ 177 | Page 0 | Page 1 ... Page N | 178 +--------+-------------------+ 179 | Super | Metadata Region | 180 | Block | | 181 +--------+-------------------+ 182 183Or formally: 184 185 <cluster0> ::= <super-block> <metadata-region> 186 187The super block is a single page located at the beginning of the partition. 188It contains basic information about the blobstore. The metadata region 189is the remainder of cluster 0 and may extend to additional clusters. 190 191 <super-block> ::= <sb-version> <sb-len> <sb-super-blob> <sb-params> 192 <sb-metadata-start> <sb-metadata-len> 193 <sb-version> ::= u32 194 <sb-len> ::= u32 # Length of this super block, in bytes. Starts from the 195 # beginning of this structure. 196 <sb-super-blob> ::= u64 # Special blobid set by the user that indicates where 197 # their starting metadata resides. 198 199 <sb-md-start> ::= u64 # Metadata start location, in pages 200 <sb-md-len> ::= u64 # Metadata length, in pages 201 202The `<sb-params>` data contains parameters specified by the user when the blob 203store was initially formatted. 204 205 <sb-params> ::= <sb-page-size> <sb-cluster-size> 206 <sb-page-size> ::= u32 # page size, in bytes. 207 # Must be a multiple of the logical block size. 208 # The implementation today requires this to be 4KiB. 209 <sb-cluster-size> ::= u32 # Cluster size, in bytes. 210 # Must be a multiple of the page size. 211 212Each blob is allocated a non-contiguous set of pages inside the metadata region 213for its metadata. These pages form a linked list. The first page in the list 214will be written in place on update, while all other pages will be written to 215fresh locations. This requires the backing device to support an atomic write 216size greater than or equal to the page size to guarantee that the operation is 217atomic. See the section on atomicity for details. 218 219Each page is defined as: 220 221 <metadata-page> ::= <blob-id> <blob-sequence-num> <blob-descriptor>* 222 <blob-next> <blob-crc> 223 <blob-id> ::= u64 # The blob guid 224 <blob-sequence-num> ::= u32 # The sequence number of this page in the linked 225 # list. 226 227 <blob-descriptor> ::= <blob-descriptor-type> <blob-descriptor-length> 228 <blob-descriptor-data> 229 <blob-descriptor-type> ::= u8 # 0 means padding, 1 means "extent", 2 means 230 # xattr. The type 231 # describes how to interpret the descriptor data. 232 <blob-descriptor-length> ::= u32 # Length of the entire descriptor 233 234 <blob-descriptor-data-padding> ::= u8 235 236 <blob-descriptor-data-extent> ::= <extent-cluster-id> <extent-cluster-count> 237 <extent-cluster-id> ::= u32 # The cluster id where this extent starts 238 <extent-cluster-count> ::= u32 # The number of clusters in this extent 239 240 <blob-descriptor-data-xattr> ::= <xattr-name-length> <xattr-value-length> 241 <xattr-name> <xattr-value> 242 <xattr-name-length> ::= u16 243 <xattr-value-length> ::= u16 244 <xattr-name> ::= u8* 245 <xattr-value> ::= u8* 246 247 <blob-next> ::= u32 # The offset into the metadata region that contains the 248 # next page of metadata. 0 means no next page. 249 <blob-crc> ::= u32 # CRC of the entire page 250 251 252Descriptors cannot span metadata pages. 253 254## Atomicity 255 256Metadata in the blobstore is cached and must be explicitly synced by the user. 257Data is not cached, however, so when a write completes the data can be 258considered durable if the metadata is synchronized. Metadata does not often 259change, and in fact only must be synchronized after these explicit operations: 260 261* resize 262* set xattr 263* remove xattr 264 265Any other operation will not dirty the metadata. Further, the metadata for each 266blob is independent of all of the others, so a synchronization operation is 267only needed on the specific blob that is dirty. 268 269The metadata consists of a linked list of pages. Updates to the metadata are 270done by first writing page 2 through N to a new location, writing page 1 in 271place to atomically update the chain, and then erasing the remainder of the old 272chain. The vast majority of the time, blobs consist of just a single metadata 273page and so this operation is very efficient. For this scheme to work the write 274to the first page must be atomic, which requires hardware support from the 275backing device. For most, if not all, NVMe SSDs, an atomic write unit of 4KiB 276can be expected. Devices specify their atomic write unit in their NVMe identify 277data - specifically in the AWUN field. 278