Name Date Size #Lines LOC

..--

aarch64/H--308234

generic/H--329243

riscv/H--16494

x86_64/H--648508

CMakeLists.txtH A D12-Sep-20231.7 KiB10091

README.mdH A D23-Jan-20234 KiB9875

inline_bcmp.hH A D12-Jul-20241.8 KiB4629

inline_bzero.hH A D12-Jul-20241,020 3014

inline_memcmp.hH A D06-Oct-20241.9 KiB4730

inline_memcpy.hH A D06-Oct-20242.2 KiB5236

inline_memmem.hH A D12-Jul-20241.5 KiB4526

inline_memmove.hH A D12-Jul-20243 KiB7355

inline_memset.hH A D06-Oct-20241.9 KiB4831

inline_strcmp.hH A D12-Jul-20241.6 KiB4527

inline_strstr.hH A D12-Jul-20241.1 KiB3016

op_aarch64.hH A D12-Jul-20249.7 KiB272207

op_builtin.hH A D12-Jul-20245.4 KiB157112

op_generic.hH A D13-Nov-202421.9 KiB584353

op_riscv.hH A D12-Jul-20243.4 KiB8558

op_x86.hH A D25-Nov-202416.2 KiB322233

utils.hH A D13-Nov-202414 KiB355203

README.md

1# The mem* framework
2
3The framework handles the following mem* functions:
4 - `memcpy`
5 - `memmove`
6 - `memset`
7 - `bzero`
8 - `bcmp`
9 - `memcmp`
10
11## Building blocks
12
13These functions can be built out of a set of lower-level operations:
14 - **`block`** : operates on a block of `SIZE` bytes.
15 - **`tail`** : operates on the last `SIZE` bytes of the buffer (e.g., `[dst + count - SIZE, dst + count]`)
16 - **`head_tail`** : operates on the first and last `SIZE` bytes. This is the same as calling `block` and `tail`.
17 - **`loop_and_tail`** : calls `block` in a loop to consume as much as possible of the `count` bytes and handle the remaining bytes with a `tail` operation.
18
19As an illustration, let's take the example of a trivial `memset` implementation:
20
21 ```C++
22 extern "C" void memset(const char* dst, int value, size_t count) {
23    if (count == 0) return;
24    if (count == 1) return Memset<1>::block(dst, value);
25    if (count == 2) return Memset<2>::block(dst, value);
26    if (count == 3) return Memset<3>::block(dst, value);
27    if (count <= 8) return Memset<4>::head_tail(dst, value, count);  // Note that 0 to 4 bytes are written twice.
28    if (count <= 16) return Memset<8>::head_tail(dst, value, count); // Same here.
29    return Memset<16>::loop_and_tail(dst, value, count);
30}
31 ```
32
33Now let's have a look into the `Memset` structure:
34
35```C++
36template <size_t Size>
37struct Memset {
38  static constexpr size_t SIZE = Size;
39
40  LIBC_INLINE static void block(Ptr dst, uint8_t value) {
41    // Implement me
42  }
43
44  LIBC_INLINE static void tail(Ptr dst, uint8_t value, size_t count) {
45    block(dst + count - SIZE, value);
46  }
47
48  LIBC_INLINE static void head_tail(Ptr dst, uint8_t value, size_t count) {
49    block(dst, value);
50    tail(dst, value, count);
51  }
52
53  LIBC_INLINE static void loop_and_tail(Ptr dst, uint8_t value, size_t count) {
54    size_t offset = 0;
55    do {
56      block(dst + offset, value);
57      offset += SIZE;
58    } while (offset < count - SIZE);
59    tail(dst, value, count);
60  }
61};
62```
63
64As you can see, the `tail`, `head_tail` and `loop_and_tail` are higher order functions that build on each others. Only `block` really needs to be implemented.
65In earlier designs we were implementing these higher order functions with templated functions but it appears that it is more readable to have the implementation explicitly stated.
66**This design is useful because it provides customization points**. For instance, for `bcmp` on `aarch64` we can provide a better implementation of `head_tail` using vector reduction intrinsics.
67
68## Scoped specializations
69
70We can have several specializations of the `Memset` structure. Depending on the target requirements we can use one or several scopes for the same implementation.
71
72In the following example we use the `generic` implementation for the small sizes but use the `x86` implementation for the loop.
73```C++
74 extern "C" void memset(const char* dst, int value, size_t count) {
75    if (count == 0) return;
76    if (count == 1) return generic::Memset<1>::block(dst, value);
77    if (count == 2) return generic::Memset<2>::block(dst, value);
78    if (count == 3) return generic::Memset<3>::block(dst, value);
79    if (count <= 8) return generic::Memset<4>::head_tail(dst, value, count);
80    if (count <= 16) return generic::Memset<8>::head_tail(dst, value, count);
81    return x86::Memset<16>::loop_and_tail(dst, value, count);
82}
83```
84
85### The `builtin` scope
86
87Ultimately we would like the compiler to provide the code for the `block` function. For this we rely on dedicated builtins available in Clang (e.g., [`__builtin_memset_inline`](https://clang.llvm.org/docs/LanguageExtensions.html#guaranteed-inlined-memset))
88
89### The `generic` scope
90
91In this scope we define pure C++ implementations using native integral types and clang vector extensions.
92
93### The arch specific scopes
94
95Then comes implementations that are using specific architectures or microarchitectures features (e.g., `rep;movsb` for `x86` or `dc zva` for `aarch64`).
96
97The purpose here is to rely on builtins as much as possible and fallback to `asm volatile` as a last resort.
98