169090143SGuillaume Chatelet# The mem* framework 269090143SGuillaume Chatelet 369090143SGuillaume ChateletThe framework handles the following mem* functions: 469090143SGuillaume Chatelet - `memcpy` 569090143SGuillaume Chatelet - `memmove` 669090143SGuillaume Chatelet - `memset` 769090143SGuillaume Chatelet - `bzero` 869090143SGuillaume Chatelet - `bcmp` 969090143SGuillaume Chatelet - `memcmp` 1069090143SGuillaume Chatelet 1169090143SGuillaume Chatelet## Building blocks 1269090143SGuillaume Chatelet 1369090143SGuillaume ChateletThese functions can be built out of a set of lower-level operations: 1469090143SGuillaume Chatelet - **`block`** : operates on a block of `SIZE` bytes. 1569090143SGuillaume Chatelet - **`tail`** : operates on the last `SIZE` bytes of the buffer (e.g., `[dst + count - SIZE, dst + count]`) 1669090143SGuillaume Chatelet - **`head_tail`** : operates on the first and last `SIZE` bytes. This is the same as calling `block` and `tail`. 1769090143SGuillaume Chatelet - **`loop_and_tail`** : calls `block` in a loop to consume as much as possible of the `count` bytes and handle the remaining bytes with a `tail` operation. 1869090143SGuillaume Chatelet 1969090143SGuillaume ChateletAs an illustration, let's take the example of a trivial `memset` implementation: 2069090143SGuillaume Chatelet 2169090143SGuillaume Chatelet ```C++ 2269090143SGuillaume Chatelet extern "C" void memset(const char* dst, int value, size_t count) { 2369090143SGuillaume Chatelet if (count == 0) return; 2469090143SGuillaume Chatelet if (count == 1) return Memset<1>::block(dst, value); 2569090143SGuillaume Chatelet if (count == 2) return Memset<2>::block(dst, value); 2669090143SGuillaume Chatelet if (count == 3) return Memset<3>::block(dst, value); 2769090143SGuillaume Chatelet if (count <= 8) return Memset<4>::head_tail(dst, value, count); // Note that 0 to 4 bytes are written twice. 2869090143SGuillaume Chatelet if (count <= 16) return Memset<8>::head_tail(dst, value, count); // Same here. 2969090143SGuillaume Chatelet return Memset<16>::loop_and_tail(dst, value, count); 3069090143SGuillaume Chatelet} 3169090143SGuillaume Chatelet ``` 3269090143SGuillaume Chatelet 3369090143SGuillaume ChateletNow let's have a look into the `Memset` structure: 3469090143SGuillaume Chatelet 3569090143SGuillaume Chatelet```C++ 3669090143SGuillaume Chatelettemplate <size_t Size> 3769090143SGuillaume Chateletstruct Memset { 3869090143SGuillaume Chatelet static constexpr size_t SIZE = Size; 3969090143SGuillaume Chatelet 40*6363320bSSiva Chandra Reddy LIBC_INLINE static void block(Ptr dst, uint8_t value) { 4169090143SGuillaume Chatelet // Implement me 4269090143SGuillaume Chatelet } 4369090143SGuillaume Chatelet 44*6363320bSSiva Chandra Reddy LIBC_INLINE static void tail(Ptr dst, uint8_t value, size_t count) { 4569090143SGuillaume Chatelet block(dst + count - SIZE, value); 4669090143SGuillaume Chatelet } 4769090143SGuillaume Chatelet 48*6363320bSSiva Chandra Reddy LIBC_INLINE static void head_tail(Ptr dst, uint8_t value, size_t count) { 4969090143SGuillaume Chatelet block(dst, value); 5069090143SGuillaume Chatelet tail(dst, value, count); 5169090143SGuillaume Chatelet } 5269090143SGuillaume Chatelet 53*6363320bSSiva Chandra Reddy LIBC_INLINE static void loop_and_tail(Ptr dst, uint8_t value, size_t count) { 5469090143SGuillaume Chatelet size_t offset = 0; 5569090143SGuillaume Chatelet do { 5669090143SGuillaume Chatelet block(dst + offset, value); 5769090143SGuillaume Chatelet offset += SIZE; 5869090143SGuillaume Chatelet } while (offset < count - SIZE); 5969090143SGuillaume Chatelet tail(dst, value, count); 6069090143SGuillaume Chatelet } 6169090143SGuillaume Chatelet}; 6269090143SGuillaume Chatelet``` 6369090143SGuillaume Chatelet 6469090143SGuillaume ChateletAs you can see, the `tail`, `head_tail` and `loop_and_tail` are higher order functions that build on each others. Only `block` really needs to be implemented. 6569090143SGuillaume ChateletIn earlier designs we were implementing these higher order functions with templated functions but it appears that it is more readable to have the implementation explicitly stated. 6669090143SGuillaume Chatelet**This design is useful because it provides customization points**. For instance, for `bcmp` on `aarch64` we can provide a better implementation of `head_tail` using vector reduction intrinsics. 6769090143SGuillaume Chatelet 6869090143SGuillaume Chatelet## Scoped specializations 6969090143SGuillaume Chatelet 7069090143SGuillaume ChateletWe can have several specializations of the `Memset` structure. Depending on the target requirements we can use one or several scopes for the same implementation. 7169090143SGuillaume Chatelet 7269090143SGuillaume ChateletIn the following example we use the `generic` implementation for the small sizes but use the `x86` implementation for the loop. 7369090143SGuillaume Chatelet```C++ 7469090143SGuillaume Chatelet extern "C" void memset(const char* dst, int value, size_t count) { 7569090143SGuillaume Chatelet if (count == 0) return; 7669090143SGuillaume Chatelet if (count == 1) return generic::Memset<1>::block(dst, value); 7769090143SGuillaume Chatelet if (count == 2) return generic::Memset<2>::block(dst, value); 7869090143SGuillaume Chatelet if (count == 3) return generic::Memset<3>::block(dst, value); 7969090143SGuillaume Chatelet if (count <= 8) return generic::Memset<4>::head_tail(dst, value, count); 8069090143SGuillaume Chatelet if (count <= 16) return generic::Memset<8>::head_tail(dst, value, count); 8169090143SGuillaume Chatelet return x86::Memset<16>::loop_and_tail(dst, value, count); 8269090143SGuillaume Chatelet} 8369090143SGuillaume Chatelet``` 8469090143SGuillaume Chatelet 8569090143SGuillaume Chatelet### The `builtin` scope 8669090143SGuillaume Chatelet 8769090143SGuillaume ChateletUltimately we would like the compiler to provide the code for the `block` function. For this we rely on dedicated builtins available in Clang (e.g., [`__builtin_memset_inline`](https://clang.llvm.org/docs/LanguageExtensions.html#guaranteed-inlined-memset)) 8869090143SGuillaume Chatelet 8969090143SGuillaume Chatelet### The `generic` scope 9069090143SGuillaume Chatelet 9169090143SGuillaume ChateletIn this scope we define pure C++ implementations using native integral types and clang vector extensions. 9269090143SGuillaume Chatelet 9369090143SGuillaume Chatelet### The arch specific scopes 9469090143SGuillaume Chatelet 9569090143SGuillaume ChateletThen comes implementations that are using specific architectures or microarchitectures features (e.g., `rep;movsb` for `x86` or `dc zva` for `aarch64`). 9669090143SGuillaume Chatelet 9769090143SGuillaume ChateletThe purpose here is to rely on builtins as much as possible and fallback to `asm volatile` as a last resort. 98