xref: /spdk/doc/memory.md (revision 45229c77ebc88c10ab3eeb2e6165d461622a050f)
170e3aac3SBen Walker# Direct Memory Access (DMA) From User Space {#memory}
24ee35e1fSBen Walker
34ee35e1fSBen WalkerThe following is an attempt to explain why all data buffers passed to SPDK must
44ee35e1fSBen Walkerbe allocated using spdk_dma_malloc() or its siblings, and why SPDK relies on
5*45229c77SJim HarrisDPDK's proven base functionality to implement memory management. (Note: DPDK
6*45229c77SJim Harrismbufs are also safe to use in applications combining SPDK and DPDK
7*45229c77SJim Harrisfunctionality.)
84ee35e1fSBen Walker
94ee35e1fSBen WalkerComputing platforms generally carve physical memory up into 4KiB segments
104ee35e1fSBen Walkercalled pages. They number the pages from 0 to N starting from the beginning of
114ee35e1fSBen Walkeraddressable memory. Operating systems then overlay 4KiB virtual memory pages on
124ee35e1fSBen Walkertop of these physical pages using arbitrarily complex mappings. See
134ee35e1fSBen Walker[Virtual Memory](https://en.wikipedia.org/wiki/Virtual_memory) for an overview.
144ee35e1fSBen Walker
154ee35e1fSBen WalkerPhysical memory is attached on channels, where each memory channel provides
164ee35e1fSBen Walkersome fixed amount of bandwidth. To optimize total memory bandwidth, the
174ee35e1fSBen Walkerphysical addressing is often set up to automatically interleave between
184ee35e1fSBen Walkerchannels. For instance, page 0 may be located on channel 0, page 1 on channel
194ee35e1fSBen Walker1, page 2 on channel 2, etc. This is so that writing to memory sequentially
204ee35e1fSBen Walkerautomatically utilizes all available channels. In practice, interleaving is
214ee35e1fSBen Walkerdone at a much more granular level than a full page.
224ee35e1fSBen Walker
234ee35e1fSBen WalkerModern computing platforms support hardware acceleration for virtual to
244ee35e1fSBen Walkerphysical translation inside of their Memory Management Unit (MMU). The MMU
254ee35e1fSBen Walkeroften supports multiple different page sizes. On recent x86_64 systems, 4KiB,
264ee35e1fSBen Walker2MiB, and 1GiB pages are supported. Typically, operating systems use 4KiB pages
274ee35e1fSBen Walkerby default.
284ee35e1fSBen Walker
294ee35e1fSBen WalkerNVMe devices transfer data to and from system memory using Direct Memory Access
304ee35e1fSBen Walker(DMA). Specifically, they send messages across the PCI bus requesting data
314ee35e1fSBen Walkertransfers. In the absence of an IOMMU, these messages contain *physical* memory
324ee35e1fSBen Walkeraddresses. These data transfers happen without involving the CPU, and the MMU
334ee35e1fSBen Walkeris responsible for making access to memory coherent.
344ee35e1fSBen Walker
354ee35e1fSBen WalkerNVMe devices also may place additional requirements on the physical layout of
364ee35e1fSBen Walkermemory for these transfers. The NVMe 1.0 specification requires all physical
374ee35e1fSBen Walkermemory to be describable by what is called a *PRP list*. To be described by a
384ee35e1fSBen WalkerPRP list, memory must have the following properties:
394ee35e1fSBen Walker
404ee35e1fSBen Walker* The memory is broken into physical 4KiB pages, which we'll call device pages.
414ee35e1fSBen Walker* The first device page can be a partial page starting at any 4-byte aligned
424ee35e1fSBen Walker  address. It may extend up to the end of the current physical page, but not
434ee35e1fSBen Walker  beyond.
444ee35e1fSBen Walker* If there is more than one device page, the first device page must end on a
454ee35e1fSBen Walker  physical 4KiB page boundary.
464ee35e1fSBen Walker* The last device page begins on a physical 4KiB page boundary, but is not
474ee35e1fSBen Walker  required to end on a physical 4KiB page boundary.
484ee35e1fSBen Walker
494ee35e1fSBen WalkerThe specification allows for device pages to be other sizes than 4KiB, but all
504ee35e1fSBen Walkerknown devices as of this writing use 4KiB.
514ee35e1fSBen Walker
524ee35e1fSBen WalkerThe NVMe 1.1 specification added support for fully flexible scatter gather lists,
534ee35e1fSBen Walkerbut the feature is optional and most devices available today do not support it.
544ee35e1fSBen Walker
554ee35e1fSBen WalkerUser space drivers run in the context of a regular process and so have access
564ee35e1fSBen Walkerto virtual memory. In order to correctly program the device with physical
574ee35e1fSBen Walkeraddresses, some method for address translation must be implemented.
584ee35e1fSBen Walker
594ee35e1fSBen WalkerThe simplest way to do this on Linux is to inspect `/proc/self/pagemap` from
604ee35e1fSBen Walkerwithin a process. This file contains the virtual address to physical address
614ee35e1fSBen Walkermappings. As of Linux 4.0, accessing these mappings requires root privileges.
624ee35e1fSBen WalkerHowever, operating systems make absolutely no guarantee that the mapping of
634ee35e1fSBen Walkervirtual to physical pages is static. The operating system has no visibility
644ee35e1fSBen Walkerinto whether a PCI device is directly transferring data to a set of physical
654ee35e1fSBen Walkeraddresses, so great care must be taken to coordinate DMA requests with page
664ee35e1fSBen Walkermovement. When an operating system flags a page such that the virtual to
674ee35e1fSBen Walkerphysical address mapping cannot be modified, this is called **pinning** the
684ee35e1fSBen Walkerpage.
694ee35e1fSBen Walker
704ee35e1fSBen WalkerThere are several reasons why the virtual to physical mappings may change, too.
714ee35e1fSBen WalkerBy far the most common reason is due to page swapping to disk. However, the
724ee35e1fSBen Walkeroperating system also moves pages during a process called compaction, which
734ee35e1fSBen Walkercollapses identical virtual pages onto the same physical page to save memory.
744ee35e1fSBen WalkerSome operating systems are also capable of doing transparent memory
754ee35e1fSBen Walkercompression. It is also increasingly possible to hot-add additional memory,
764ee35e1fSBen Walkerwhich may trigger a physical address rebalance to optimize interleaving.
774ee35e1fSBen Walker
784ee35e1fSBen WalkerPOSIX provides the `mlock` call that forces a virtual page of memory to always
794ee35e1fSBen Walkerbe backed by a physical page. In effect, this is disabling swapping. This does
804ee35e1fSBen Walker*not* guarantee, however, that the virtual to physical address mapping is
814ee35e1fSBen Walkerstatic. The `mlock` call should not be confused with a **pin** call, and it
824ee35e1fSBen Walkerturns out that POSIX does not define an API for pinning memory. Therefore, the
834ee35e1fSBen Walkermechanism to allocate pinned memory is operating system specific.
844ee35e1fSBen Walker
854ee35e1fSBen WalkerSPDK relies on DPDK to allocate pinned memory. On Linux, DPDK does this by
864ee35e1fSBen Walkerallocating `hugepages` (by default, 2MiB). The Linux kernel treats hugepages
874ee35e1fSBen Walkerdifferently than regular 4KiB pages. Specifically, the operating system will
884ee35e1fSBen Walkernever change their physical location. This is not by intent, and so things
894ee35e1fSBen Walkercould change in future versions, but it is true today and has been for a number
903e229714SDarek Stojaczykof years (see the later section on the IOMMU for a future-proof solution).
914ee35e1fSBen Walker
924ee35e1fSBen WalkerWith this explanation, hopefully it is now clear why all data buffers passed to
934ee35e1fSBen WalkerSPDK must be allocated using spdk_dma_malloc() or its siblings. The buffers
944ee35e1fSBen Walkermust be allocated specifically so that they are pinned and so that physical
954ee35e1fSBen Walkeraddresses are known.
964ee35e1fSBen Walker
971e1fd9acSwawryk## IOMMU Support
984ee35e1fSBen Walker
994ee35e1fSBen WalkerMany platforms contain an extra piece of hardware called an I/O Memory
1004ee35e1fSBen WalkerManagement Unit (IOMMU). An IOMMU is much like a regular MMU, except it
1014ee35e1fSBen Walkerprovides virtualized address spaces to peripheral devices (i.e. on the PCI
1024ee35e1fSBen Walkerbus). The MMU knows about virtual to physical mappings per process on the
1034ee35e1fSBen Walkersystem, so the IOMMU associates a particular device with one of these mappings
1044ee35e1fSBen Walkerand then allows the user to assign arbitrary *bus addresses* to virtual
1054ee35e1fSBen Walkeraddresses in their process. All DMA operations between the PCI device and
1064ee35e1fSBen Walkersystem memory are then translated through the IOMMU by converting the bus
1074ee35e1fSBen Walkeraddress to a virtual address and then the virtual address to the physical
1084ee35e1fSBen Walkeraddress. This allows the operating system to freely modify the virtual to
1094ee35e1fSBen Walkerphysical address mapping without breaking ongoing DMA operations. Linux
1104ee35e1fSBen Walkerprovides a device driver, `vfio-pci`, that allows a user to configure the IOMMU
1114ee35e1fSBen Walkerwith their current process.
1124ee35e1fSBen Walker
1134ee35e1fSBen WalkerThis is a future-proof, hardware-accelerated solution for performing DMA
1144ee35e1fSBen Walkeroperations into and out of a user space process and forms the long-term
1154ee35e1fSBen Walkerfoundation for SPDK and DPDK's memory management strategy. We highly recommend
1164ee35e1fSBen Walkerthat applications are deployed using vfio and the IOMMU enabled, which is fully
1174ee35e1fSBen Walkersupported today.
118