xref: /llvm-project/bolt/docs/OptimizingLinux.md (revision c1912b4dd772b79814a17b03ad183959f73034cc)
1ec2fb59eSMaksim Panchenko# Optimizing Linux Kernel with BOLT
2ec2fb59eSMaksim Panchenko
3ec2fb59eSMaksim Panchenko
4ec2fb59eSMaksim Panchenko## Introduction
5ec2fb59eSMaksim Panchenko
6ec2fb59eSMaksim PanchenkoMany Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance.
7ec2fb59eSMaksim Panchenko
8ec2fb59eSMaksim PanchenkoBOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
9ec2fb59eSMaksim Panchenko
10ec2fb59eSMaksim PanchenkoWhile BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications.
11ec2fb59eSMaksim Panchenko
12ec2fb59eSMaksim PanchenkoBOLT has been successfully applied to and tested with several flavors of the x86-64 Linux kernel.
13ec2fb59eSMaksim Panchenko
14ec2fb59eSMaksim Panchenko
15ec2fb59eSMaksim Panchenko## QuickStart Guide
16ec2fb59eSMaksim Panchenko
17ec2fb59eSMaksim PanchenkoBOLT operates on a statically-linked kernel executable, a.k.a. `vmlinux` binary. However, most Linux distributions use a `vmlinuz` compressed image for system booting. To use BOLT on the kernel, you must either repackage `vmlinuz` after BOLT optimizations or add steps for running BOLT into the kernel build and rebuild `vmlinuz`. Uncompressing `vmlinuz` and repackaging it with a new `vmlinux` binary falls beyond the scope of this guide, and at some point, we may add the capability to run BOLT directly on `vmlinuz`. Meanwhile, this guide focuses on steps for integrating BOLT into the kernel build process.
18ec2fb59eSMaksim Panchenko
19ec2fb59eSMaksim Panchenko
20ec2fb59eSMaksim Panchenko### Building the Kernel
21ec2fb59eSMaksim Panchenko
22ec2fb59eSMaksim PanchenkoAfter downloading the kernel sources and configuration for your distribution, you should be able to build `vmlinuz` using the `make bzImage` command. Ideally, the kernel should binary match the kernel on the system you are about to optimize (the target system). The binary matching part is critical as BOLT performs profile matching and optimizations at the binary level. We recommend installing a freshly built kernel on the target system to avoid any discrepancies.
23ec2fb59eSMaksim Panchenko
24ec2fb59eSMaksim PanchenkoNote that the kernel build will produce several artifacts besides bzImage. The most important of them is the uncompressed `vmlinux` binary, which will be used in the next steps. Make sure to save this file.
25ec2fb59eSMaksim Panchenko
26ec2fb59eSMaksim PanchenkoBuild and target systems should have a `perf` tool installed for collecting and processing profiles. If your build system differs from the target, make sure `perf` versions are compatible. The build system should also have the latest BOLT binary and tools (`llvm-bolt`, `perf2bolt`).
27ec2fb59eSMaksim Panchenko
28ec2fb59eSMaksim PanchenkoOnce the target system boots with the freshly-built kernel, start your workload, such as a database benchmark. While the system is under load, collect the kernel profile using perf:
29ec2fb59eSMaksim Panchenko
30ec2fb59eSMaksim Panchenko
31ec2fb59eSMaksim Panchenko```bash
32ec2fb59eSMaksim Panchenko$ sudo perf record -a -e cycles -j any,k -F 5000 -- sleep 600
33ec2fb59eSMaksim Panchenko```
34ec2fb59eSMaksim Panchenko
35ec2fb59eSMaksim Panchenko
36ec2fb59eSMaksim PanchenkoConvert `perf` profile into a format suitable for BOLT passing the `vmlinux` binary to `perf2bolt`:
37ec2fb59eSMaksim Panchenko
38ec2fb59eSMaksim Panchenko
39ec2fb59eSMaksim Panchenko```bash
40*c1912b4dSPeter Jung$ sudo chown $USER perf.data
41ec2fb59eSMaksim Panchenko$ perf2bolt -p perf.data -o perf.fdata vmlinux
42ec2fb59eSMaksim Panchenko```
43ec2fb59eSMaksim Panchenko
44ec2fb59eSMaksim Panchenko
45ec2fb59eSMaksim PanchenkoUnder a high load, `perf.data` should be several gigabytes in size and you should expect the converted `perf.fdata` not to exceed 100 MB.
46ec2fb59eSMaksim Panchenko
47a0c6b8aeSMaksim PanchenkoProfiles collected from multiple workloads could be joined into a single profile using `merge-fdata` utility:
48a0c6b8aeSMaksim Panchenko```bash
49a0c6b8aeSMaksim Panchenko$ merge-fdata perf.1.fdata perf.2.fdata ... perf.<N>.fdata > perf.merged.fdata
50a0c6b8aeSMaksim Panchenko```
51a0c6b8aeSMaksim Panchenko
52ec2fb59eSMaksim PanchenkoTwo changes are required for the kernel build. The first one is optional but highly recommended. It introduces a BOLT-reserved space into `vmlinux` code section:
53ec2fb59eSMaksim Panchenko
54ec2fb59eSMaksim Panchenko
55ec2fb59eSMaksim Panchenko```diff
56ec2fb59eSMaksim Panchenko--- a/arch/x86/kernel/vmlinux.lds.S
57ec2fb59eSMaksim Panchenko+++ b/arch/x86/kernel/vmlinux.lds.S
58ec2fb59eSMaksim Panchenko@@ -139,6 +139,11 @@ SECTIONS
59ec2fb59eSMaksim Panchenko                STATIC_CALL_TEXT
60ec2fb59eSMaksim Panchenko                *(.gnu.warning)
61ec2fb59eSMaksim Panchenko
62ec2fb59eSMaksim Panchenko+    /* Allocate space for BOLT */
63ec2fb59eSMaksim Panchenko+    __bolt_reserved_start = .;
64ec2fb59eSMaksim Panchenko+               . += 2048 * 1024;
65ec2fb59eSMaksim Panchenko+    __bolt_reserved_end = .;
66ec2fb59eSMaksim Panchenko+
67ec2fb59eSMaksim Panchenko #ifdef CONFIG_RETPOLINE
68ec2fb59eSMaksim Panchenko                __indirect_thunk_start = .;
69ec2fb59eSMaksim Panchenko                *(.text.__x86.*)
70ec2fb59eSMaksim Panchenko```
71ec2fb59eSMaksim Panchenko
72ec2fb59eSMaksim Panchenko
73ec2fb59eSMaksim PanchenkoThe second patch adds a step that runs BOLT on `vmlinux` binary:
74ec2fb59eSMaksim Panchenko
75ec2fb59eSMaksim Panchenko
76ec2fb59eSMaksim Panchenko```diff
77ec2fb59eSMaksim Panchenko--- a/scripts/link-vmlinux.sh
78ec2fb59eSMaksim Panchenko+++ b/scripts/link-vmlinux.sh
79ec2fb59eSMaksim Panchenko@@ -340,5 +340,13 @@ if is_enabled CONFIG_KALLSYMS; then
80ec2fb59eSMaksim Panchenko        fi
81ec2fb59eSMaksim Panchenko fi
82ec2fb59eSMaksim Panchenko
83ec2fb59eSMaksim Panchenko+# Apply BOLT
84ec2fb59eSMaksim Panchenko+BOLT=llvm-bolt
85ec2fb59eSMaksim Panchenko+BOLT_PROFILE=perf.fdata
86ec2fb59eSMaksim Panchenko+BOLT_OPTS="--dyno-stats --eliminate-unreachable=0 --reorder-blocks=ext-tsp --simplify-conditional-tail-calls=0 --skip-funcs=__entry_text_start,irq_entries_start --split-functions"
87ec2fb59eSMaksim Panchenko+mv vmlinux vmlinux.pre-bolt
88ec2fb59eSMaksim Panchenko+echo BOLTing vmlinux
89ec2fb59eSMaksim Panchenko+${BOLT} vmlinux.pre-bolt -o vmlinux --data ${BOLT_PROFILE} ${BOLT_OPTS}
90ec2fb59eSMaksim Panchenko+
91ec2fb59eSMaksim Panchenko # For fixdep
92ec2fb59eSMaksim Panchenko echo "vmlinux: $0" > .vmlinux.d
93ec2fb59eSMaksim Panchenko```
94ec2fb59eSMaksim Panchenko
95ec2fb59eSMaksim Panchenko
96ec2fb59eSMaksim PanchenkoIf you skipped the first step or are running BOLT on a pre-built `vmlinux` binary, drop the `--split-functions` option.
97ec2fb59eSMaksim Panchenko
98ec2fb59eSMaksim Panchenko
99ec2fb59eSMaksim Panchenko## Performance Expectations
100ec2fb59eSMaksim Panchenko
101ec2fb59eSMaksim PanchenkoBy improving the code layout, BOLT can boost the kernel's performance by up to 5% by reducing instruction cache misses and branch mispredictions. When measuring total system performance, you should scale this number accordingly based on the time your application spends in the kernel (excluding I/O time).
102ec2fb59eSMaksim Panchenko
103ec2fb59eSMaksim Panchenko
104ec2fb59eSMaksim Panchenko## Profile Quality
105ec2fb59eSMaksim Panchenko
106ec2fb59eSMaksim PanchenkoThe timing and duration of the profiling may have a significant effect on the performance of the BOLTed kernel. If you don't know your workload well, it's recommended that you profile for the whole duration of the benchmark run. As longer times will result in larger `perf.data` files, you can lower the profiling frequency by providing a smaller value of `-F` flag. E.g., to record the kernel profile for half an hour, use the following command:
107ec2fb59eSMaksim Panchenko
108ec2fb59eSMaksim Panchenko
109ec2fb59eSMaksim Panchenko```bash
110ec2fb59eSMaksim Panchenko$ sudo perf record -a -e cycles -j any,k -F 1000 -- sleep 1800
111ec2fb59eSMaksim Panchenko```
112ec2fb59eSMaksim Panchenko
113ec2fb59eSMaksim Panchenko
114ec2fb59eSMaksim Panchenko
115ec2fb59eSMaksim Panchenko## BOLT Disassembly
116ec2fb59eSMaksim Panchenko
117ec2fb59eSMaksim PanchenkoBOLT annotates the disassembly with control-flow information and attaches Linux-specific metadata to the code. To view annotated disassembly, run:
118ec2fb59eSMaksim Panchenko
119ec2fb59eSMaksim Panchenko
120ec2fb59eSMaksim Panchenko```bash
121ec2fb59eSMaksim Panchenko$ llvm-bolt vmlinux -o /dev/null --print-cfg
122ec2fb59eSMaksim Panchenko```
123ec2fb59eSMaksim Panchenko
124ec2fb59eSMaksim Panchenko
125ec2fb59eSMaksim PanchenkoIf you want to limit the disassembly to a set of functions, add `--print-only=<func1regex>,<func2regex>,...`, where a function name is specified using regular expressions.
126