1# Direct Memory Access (DMA) From User Space {#memory} 2 3The following is an attempt to explain why all data buffers passed to SPDK must 4be allocated using spdk_dma_malloc() or its siblings, and why SPDK relies on 5DPDK's proven base functionality to implement memory management. (Note: DPDK 6mbufs are also safe to use in applications combining SPDK and DPDK 7functionality.) 8 9Computing platforms generally carve physical memory up into 4KiB segments 10called pages. They number the pages from 0 to N starting from the beginning of 11addressable memory. Operating systems then overlay 4KiB virtual memory pages on 12top of these physical pages using arbitrarily complex mappings. See 13[Virtual Memory](https://en.wikipedia.org/wiki/Virtual_memory) for an overview. 14 15Physical memory is attached on channels, where each memory channel provides 16some fixed amount of bandwidth. To optimize total memory bandwidth, the 17physical addressing is often set up to automatically interleave between 18channels. For instance, page 0 may be located on channel 0, page 1 on channel 191, page 2 on channel 2, etc. This is so that writing to memory sequentially 20automatically utilizes all available channels. In practice, interleaving is 21done at a much more granular level than a full page. 22 23Modern computing platforms support hardware acceleration for virtual to 24physical translation inside of their Memory Management Unit (MMU). The MMU 25often supports multiple different page sizes. On recent x86_64 systems, 4KiB, 262MiB, and 1GiB pages are supported. Typically, operating systems use 4KiB pages 27by default. 28 29NVMe devices transfer data to and from system memory using Direct Memory Access 30(DMA). Specifically, they send messages across the PCI bus requesting data 31transfers. In the absence of an IOMMU, these messages contain *physical* memory 32addresses. These data transfers happen without involving the CPU, and the MMU 33is responsible for making access to memory coherent. 34 35NVMe devices also may place additional requirements on the physical layout of 36memory for these transfers. The NVMe 1.0 specification requires all physical 37memory to be describable by what is called a *PRP list*. To be described by a 38PRP list, memory must have the following properties: 39 40* The memory is broken into physical 4KiB pages, which we'll call device pages. 41* The first device page can be a partial page starting at any 4-byte aligned 42 address. It may extend up to the end of the current physical page, but not 43 beyond. 44* If there is more than one device page, the first device page must end on a 45 physical 4KiB page boundary. 46* The last device page begins on a physical 4KiB page boundary, but is not 47 required to end on a physical 4KiB page boundary. 48 49The specification allows for device pages to be other sizes than 4KiB, but all 50known devices as of this writing use 4KiB. 51 52The NVMe 1.1 specification added support for fully flexible scatter gather lists, 53but the feature is optional and most devices available today do not support it. 54 55User space drivers run in the context of a regular process and so have access 56to virtual memory. In order to correctly program the device with physical 57addresses, some method for address translation must be implemented. 58 59The simplest way to do this on Linux is to inspect `/proc/self/pagemap` from 60within a process. This file contains the virtual address to physical address 61mappings. As of Linux 4.0, accessing these mappings requires root privileges. 62However, operating systems make absolutely no guarantee that the mapping of 63virtual to physical pages is static. The operating system has no visibility 64into whether a PCI device is directly transferring data to a set of physical 65addresses, so great care must be taken to coordinate DMA requests with page 66movement. When an operating system flags a page such that the virtual to 67physical address mapping cannot be modified, this is called **pinning** the 68page. 69 70There are several reasons why the virtual to physical mappings may change, too. 71By far the most common reason is due to page swapping to disk. However, the 72operating system also moves pages during a process called compaction, which 73collapses identical virtual pages onto the same physical page to save memory. 74Some operating systems are also capable of doing transparent memory 75compression. It is also increasingly possible to hot-add additional memory, 76which may trigger a physical address rebalance to optimize interleaving. 77 78POSIX provides the `mlock` call that forces a virtual page of memory to always 79be backed by a physical page. In effect, this is disabling swapping. This does 80*not* guarantee, however, that the virtual to physical address mapping is 81static. The `mlock` call should not be confused with a **pin** call, and it 82turns out that POSIX does not define an API for pinning memory. Therefore, the 83mechanism to allocate pinned memory is operating system specific. 84 85SPDK relies on DPDK to allocate pinned memory. On Linux, DPDK does this by 86allocating `hugepages` (by default, 2MiB). The Linux kernel treats hugepages 87differently than regular 4KiB pages. Specifically, the operating system will 88never change their physical location. This is not by intent, and so things 89could change in future versions, but it is true today and has been for a number 90of years (see the later section on the IOMMU for a future-proof solution). 91 92With this explanation, hopefully it is now clear why all data buffers passed to 93SPDK must be allocated using spdk_dma_malloc() or its siblings. The buffers 94must be allocated specifically so that they are pinned and so that physical 95addresses are known. 96 97## IOMMU Support 98 99Many platforms contain an extra piece of hardware called an I/O Memory 100Management Unit (IOMMU). An IOMMU is much like a regular MMU, except it 101provides virtualized address spaces to peripheral devices (i.e. on the PCI 102bus). The MMU knows about virtual to physical mappings per process on the 103system, so the IOMMU associates a particular device with one of these mappings 104and then allows the user to assign arbitrary *bus addresses* to virtual 105addresses in their process. All DMA operations between the PCI device and 106system memory are then translated through the IOMMU by converting the bus 107address to a virtual address and then the virtual address to the physical 108address. This allows the operating system to freely modify the virtual to 109physical address mapping without breaking ongoing DMA operations. Linux 110provides a device driver, `vfio-pci`, that allows a user to configure the IOMMU 111with their current process. 112 113This is a future-proof, hardware-accelerated solution for performing DMA 114operations into and out of a user space process and forms the long-term 115foundation for SPDK and DPDK's memory management strategy. We highly recommend 116that applications are deployed using vfio and the IOMMU enabled, which is fully 117supported today. 118