1ec2fb59eSMaksim Panchenko# Optimizing Linux Kernel with BOLT 2ec2fb59eSMaksim Panchenko 3ec2fb59eSMaksim Panchenko 4ec2fb59eSMaksim Panchenko## Introduction 5ec2fb59eSMaksim Panchenko 6ec2fb59eSMaksim PanchenkoMany Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance. 7ec2fb59eSMaksim Panchenko 8ec2fb59eSMaksim PanchenkoBOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information. 9ec2fb59eSMaksim Panchenko 10ec2fb59eSMaksim PanchenkoWhile BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications. 11ec2fb59eSMaksim Panchenko 12ec2fb59eSMaksim PanchenkoBOLT has been successfully applied to and tested with several flavors of the x86-64 Linux kernel. 13ec2fb59eSMaksim Panchenko 14ec2fb59eSMaksim Panchenko 15ec2fb59eSMaksim Panchenko## QuickStart Guide 16ec2fb59eSMaksim Panchenko 17ec2fb59eSMaksim PanchenkoBOLT operates on a statically-linked kernel executable, a.k.a. `vmlinux` binary. However, most Linux distributions use a `vmlinuz` compressed image for system booting. To use BOLT on the kernel, you must either repackage `vmlinuz` after BOLT optimizations or add steps for running BOLT into the kernel build and rebuild `vmlinuz`. Uncompressing `vmlinuz` and repackaging it with a new `vmlinux` binary falls beyond the scope of this guide, and at some point, we may add the capability to run BOLT directly on `vmlinuz`. Meanwhile, this guide focuses on steps for integrating BOLT into the kernel build process. 18ec2fb59eSMaksim Panchenko 19ec2fb59eSMaksim Panchenko 20ec2fb59eSMaksim Panchenko### Building the Kernel 21ec2fb59eSMaksim Panchenko 22ec2fb59eSMaksim PanchenkoAfter downloading the kernel sources and configuration for your distribution, you should be able to build `vmlinuz` using the `make bzImage` command. Ideally, the kernel should binary match the kernel on the system you are about to optimize (the target system). The binary matching part is critical as BOLT performs profile matching and optimizations at the binary level. We recommend installing a freshly built kernel on the target system to avoid any discrepancies. 23ec2fb59eSMaksim Panchenko 24ec2fb59eSMaksim PanchenkoNote that the kernel build will produce several artifacts besides bzImage. The most important of them is the uncompressed `vmlinux` binary, which will be used in the next steps. Make sure to save this file. 25ec2fb59eSMaksim Panchenko 26ec2fb59eSMaksim PanchenkoBuild and target systems should have a `perf` tool installed for collecting and processing profiles. If your build system differs from the target, make sure `perf` versions are compatible. The build system should also have the latest BOLT binary and tools (`llvm-bolt`, `perf2bolt`). 27ec2fb59eSMaksim Panchenko 28ec2fb59eSMaksim PanchenkoOnce the target system boots with the freshly-built kernel, start your workload, such as a database benchmark. While the system is under load, collect the kernel profile using perf: 29ec2fb59eSMaksim Panchenko 30ec2fb59eSMaksim Panchenko 31ec2fb59eSMaksim Panchenko```bash 32ec2fb59eSMaksim Panchenko$ sudo perf record -a -e cycles -j any,k -F 5000 -- sleep 600 33ec2fb59eSMaksim Panchenko``` 34ec2fb59eSMaksim Panchenko 35ec2fb59eSMaksim Panchenko 36ec2fb59eSMaksim PanchenkoConvert `perf` profile into a format suitable for BOLT passing the `vmlinux` binary to `perf2bolt`: 37ec2fb59eSMaksim Panchenko 38ec2fb59eSMaksim Panchenko 39ec2fb59eSMaksim Panchenko```bash 40*c1912b4dSPeter Jung$ sudo chown $USER perf.data 41ec2fb59eSMaksim Panchenko$ perf2bolt -p perf.data -o perf.fdata vmlinux 42ec2fb59eSMaksim Panchenko``` 43ec2fb59eSMaksim Panchenko 44ec2fb59eSMaksim Panchenko 45ec2fb59eSMaksim PanchenkoUnder a high load, `perf.data` should be several gigabytes in size and you should expect the converted `perf.fdata` not to exceed 100 MB. 46ec2fb59eSMaksim Panchenko 47a0c6b8aeSMaksim PanchenkoProfiles collected from multiple workloads could be joined into a single profile using `merge-fdata` utility: 48a0c6b8aeSMaksim Panchenko```bash 49a0c6b8aeSMaksim Panchenko$ merge-fdata perf.1.fdata perf.2.fdata ... perf.<N>.fdata > perf.merged.fdata 50a0c6b8aeSMaksim Panchenko``` 51a0c6b8aeSMaksim Panchenko 52ec2fb59eSMaksim PanchenkoTwo changes are required for the kernel build. The first one is optional but highly recommended. It introduces a BOLT-reserved space into `vmlinux` code section: 53ec2fb59eSMaksim Panchenko 54ec2fb59eSMaksim Panchenko 55ec2fb59eSMaksim Panchenko```diff 56ec2fb59eSMaksim Panchenko--- a/arch/x86/kernel/vmlinux.lds.S 57ec2fb59eSMaksim Panchenko+++ b/arch/x86/kernel/vmlinux.lds.S 58ec2fb59eSMaksim Panchenko@@ -139,6 +139,11 @@ SECTIONS 59ec2fb59eSMaksim Panchenko STATIC_CALL_TEXT 60ec2fb59eSMaksim Panchenko *(.gnu.warning) 61ec2fb59eSMaksim Panchenko 62ec2fb59eSMaksim Panchenko+ /* Allocate space for BOLT */ 63ec2fb59eSMaksim Panchenko+ __bolt_reserved_start = .; 64ec2fb59eSMaksim Panchenko+ . += 2048 * 1024; 65ec2fb59eSMaksim Panchenko+ __bolt_reserved_end = .; 66ec2fb59eSMaksim Panchenko+ 67ec2fb59eSMaksim Panchenko #ifdef CONFIG_RETPOLINE 68ec2fb59eSMaksim Panchenko __indirect_thunk_start = .; 69ec2fb59eSMaksim Panchenko *(.text.__x86.*) 70ec2fb59eSMaksim Panchenko``` 71ec2fb59eSMaksim Panchenko 72ec2fb59eSMaksim Panchenko 73ec2fb59eSMaksim PanchenkoThe second patch adds a step that runs BOLT on `vmlinux` binary: 74ec2fb59eSMaksim Panchenko 75ec2fb59eSMaksim Panchenko 76ec2fb59eSMaksim Panchenko```diff 77ec2fb59eSMaksim Panchenko--- a/scripts/link-vmlinux.sh 78ec2fb59eSMaksim Panchenko+++ b/scripts/link-vmlinux.sh 79ec2fb59eSMaksim Panchenko@@ -340,5 +340,13 @@ if is_enabled CONFIG_KALLSYMS; then 80ec2fb59eSMaksim Panchenko fi 81ec2fb59eSMaksim Panchenko fi 82ec2fb59eSMaksim Panchenko 83ec2fb59eSMaksim Panchenko+# Apply BOLT 84ec2fb59eSMaksim Panchenko+BOLT=llvm-bolt 85ec2fb59eSMaksim Panchenko+BOLT_PROFILE=perf.fdata 86ec2fb59eSMaksim Panchenko+BOLT_OPTS="--dyno-stats --eliminate-unreachable=0 --reorder-blocks=ext-tsp --simplify-conditional-tail-calls=0 --skip-funcs=__entry_text_start,irq_entries_start --split-functions" 87ec2fb59eSMaksim Panchenko+mv vmlinux vmlinux.pre-bolt 88ec2fb59eSMaksim Panchenko+echo BOLTing vmlinux 89ec2fb59eSMaksim Panchenko+${BOLT} vmlinux.pre-bolt -o vmlinux --data ${BOLT_PROFILE} ${BOLT_OPTS} 90ec2fb59eSMaksim Panchenko+ 91ec2fb59eSMaksim Panchenko # For fixdep 92ec2fb59eSMaksim Panchenko echo "vmlinux: $0" > .vmlinux.d 93ec2fb59eSMaksim Panchenko``` 94ec2fb59eSMaksim Panchenko 95ec2fb59eSMaksim Panchenko 96ec2fb59eSMaksim PanchenkoIf you skipped the first step or are running BOLT on a pre-built `vmlinux` binary, drop the `--split-functions` option. 97ec2fb59eSMaksim Panchenko 98ec2fb59eSMaksim Panchenko 99ec2fb59eSMaksim Panchenko## Performance Expectations 100ec2fb59eSMaksim Panchenko 101ec2fb59eSMaksim PanchenkoBy improving the code layout, BOLT can boost the kernel's performance by up to 5% by reducing instruction cache misses and branch mispredictions. When measuring total system performance, you should scale this number accordingly based on the time your application spends in the kernel (excluding I/O time). 102ec2fb59eSMaksim Panchenko 103ec2fb59eSMaksim Panchenko 104ec2fb59eSMaksim Panchenko## Profile Quality 105ec2fb59eSMaksim Panchenko 106ec2fb59eSMaksim PanchenkoThe timing and duration of the profiling may have a significant effect on the performance of the BOLTed kernel. If you don't know your workload well, it's recommended that you profile for the whole duration of the benchmark run. As longer times will result in larger `perf.data` files, you can lower the profiling frequency by providing a smaller value of `-F` flag. E.g., to record the kernel profile for half an hour, use the following command: 107ec2fb59eSMaksim Panchenko 108ec2fb59eSMaksim Panchenko 109ec2fb59eSMaksim Panchenko```bash 110ec2fb59eSMaksim Panchenko$ sudo perf record -a -e cycles -j any,k -F 1000 -- sleep 1800 111ec2fb59eSMaksim Panchenko``` 112ec2fb59eSMaksim Panchenko 113ec2fb59eSMaksim Panchenko 114ec2fb59eSMaksim Panchenko 115ec2fb59eSMaksim Panchenko## BOLT Disassembly 116ec2fb59eSMaksim Panchenko 117ec2fb59eSMaksim PanchenkoBOLT annotates the disassembly with control-flow information and attaches Linux-specific metadata to the code. To view annotated disassembly, run: 118ec2fb59eSMaksim Panchenko 119ec2fb59eSMaksim Panchenko 120ec2fb59eSMaksim Panchenko```bash 121ec2fb59eSMaksim Panchenko$ llvm-bolt vmlinux -o /dev/null --print-cfg 122ec2fb59eSMaksim Panchenko``` 123ec2fb59eSMaksim Panchenko 124ec2fb59eSMaksim Panchenko 125ec2fb59eSMaksim PanchenkoIf you want to limit the disassembly to a set of functions, add `--print-only=<func1regex>,<func2regex>,...`, where a function name is specified using regular expressions. 126