xref: /spdk/doc/memory.md (revision 9889ab2dc80e40dae92dcef361d53dcba722043d)
1# Direct Memory Access (DMA) From User Space {#memory}
2
3The following is an attempt to explain why all data buffers passed to SPDK must
4be allocated using spdk_dma_malloc() or its siblings, and why SPDK relies on
5DPDK's proven base functionality to implement memory management.
6
7Computing platforms generally carve physical memory up into 4KiB segments
8called pages. They number the pages from 0 to N starting from the beginning of
9addressable memory. Operating systems then overlay 4KiB virtual memory pages on
10top of these physical pages using arbitrarily complex mappings. See
11[Virtual Memory](https://en.wikipedia.org/wiki/Virtual_memory) for an overview.
12
13Physical memory is attached on channels, where each memory channel provides
14some fixed amount of bandwidth. To optimize total memory bandwidth, the
15physical addressing is often set up to automatically interleave between
16channels. For instance, page 0 may be located on channel 0, page 1 on channel
171, page 2 on channel 2, etc. This is so that writing to memory sequentially
18automatically utilizes all available channels. In practice, interleaving is
19done at a much more granular level than a full page.
20
21Modern computing platforms support hardware acceleration for virtual to
22physical translation inside of their Memory Management Unit (MMU). The MMU
23often supports multiple different page sizes. On recent x86_64 systems, 4KiB,
242MiB, and 1GiB pages are supported. Typically, operating systems use 4KiB pages
25by default.
26
27NVMe devices transfer data to and from system memory using Direct Memory Access
28(DMA). Specifically, they send messages across the PCI bus requesting data
29transfers. In the absence of an IOMMU, these messages contain *physical* memory
30addresses. These data transfers happen without involving the CPU, and the MMU
31is responsible for making access to memory coherent.
32
33NVMe devices also may place additional requirements on the physical layout of
34memory for these transfers. The NVMe 1.0 specification requires all physical
35memory to be describable by what is called a *PRP list*. To be described by a
36PRP list, memory must have the following properties:
37
38* The memory is broken into physical 4KiB pages, which we'll call device pages.
39* The first device page can be a partial page starting at any 4-byte aligned
40  address. It may extend up to the end of the current physical page, but not
41  beyond.
42* If there is more than one device page, the first device page must end on a
43  physical 4KiB page boundary.
44* The last device page begins on a physical 4KiB page boundary, but is not
45  required to end on a physical 4KiB page boundary.
46
47The specification allows for device pages to be other sizes than 4KiB, but all
48known devices as of this writing use 4KiB.
49
50The NVMe 1.1 specification added support for fully flexible scatter gather lists,
51but the feature is optional and most devices available today do not support it.
52
53User space drivers run in the context of a regular process and so have access
54to virtual memory. In order to correctly program the device with physical
55addresses, some method for address translation must be implemented.
56
57The simplest way to do this on Linux is to inspect `/proc/self/pagemap` from
58within a process. This file contains the virtual address to physical address
59mappings. As of Linux 4.0, accessing these mappings requires root privileges.
60However, operating systems make absolutely no guarantee that the mapping of
61virtual to physical pages is static. The operating system has no visibility
62into whether a PCI device is directly transferring data to a set of physical
63addresses, so great care must be taken to coordinate DMA requests with page
64movement. When an operating system flags a page such that the virtual to
65physical address mapping cannot be modified, this is called **pinning** the
66page.
67
68There are several reasons why the virtual to physical mappings may change, too.
69By far the most common reason is due to page swapping to disk. However, the
70operating system also moves pages during a process called compaction, which
71collapses identical virtual pages onto the same physical page to save memory.
72Some operating systems are also capable of doing transparent memory
73compression. It is also increasingly possible to hot-add additional memory,
74which may trigger a physical address rebalance to optimize interleaving.
75
76POSIX provides the `mlock` call that forces a virtual page of memory to always
77be backed by a physical page. In effect, this is disabling swapping. This does
78*not* guarantee, however, that the virtual to physical address mapping is
79static. The `mlock` call should not be confused with a **pin** call, and it
80turns out that POSIX does not define an API for pinning memory. Therefore, the
81mechanism to allocate pinned memory is operating system specific.
82
83SPDK relies on DPDK to allocate pinned memory. On Linux, DPDK does this by
84allocating `hugepages` (by default, 2MiB). The Linux kernel treats hugepages
85differently than regular 4KiB pages. Specifically, the operating system will
86never change their physical location. This is not by intent, and so things
87could change in future versions, but it is true today and has been for a number
88of years (see the later section on the IOMMU for a future-proof solution).
89
90With this explanation, hopefully it is now clear why all data buffers passed to
91SPDK must be allocated using spdk_dma_malloc() or its siblings. The buffers
92must be allocated specifically so that they are pinned and so that physical
93addresses are known.
94
95# IOMMU Support
96
97Many platforms contain an extra piece of hardware called an I/O Memory
98Management Unit (IOMMU). An IOMMU is much like a regular MMU, except it
99provides virtualized address spaces to peripheral devices (i.e. on the PCI
100bus). The MMU knows about virtual to physical mappings per process on the
101system, so the IOMMU associates a particular device with one of these mappings
102and then allows the user to assign arbitrary *bus addresses* to virtual
103addresses in their process. All DMA operations between the PCI device and
104system memory are then translated through the IOMMU by converting the bus
105address to a virtual address and then the virtual address to the physical
106address. This allows the operating system to freely modify the virtual to
107physical address mapping without breaking ongoing DMA operations. Linux
108provides a device driver, `vfio-pci`, that allows a user to configure the IOMMU
109with their current process.
110
111This is a future-proof, hardware-accelerated solution for performing DMA
112operations into and out of a user space process and forms the long-term
113foundation for SPDK and DPDK's memory management strategy. We highly recommend
114that applications are deployed using vfio and the IOMMU enabled, which is fully
115supported today.
116