Name
Date
Size
#Lines
LOC

..--

cmake/modules/H--3629

docs/H--4,9723,416

include/bolt/H--20,78811,674

lib/H--62,48048,134

runtime/H--3,1532,392

test/H--101,14496,467

tools/H--1,292992

unittests/H--496356

utils/H--1,134853

CMakeLists.txtH A D17-Jan-20257.3 KiB153131

LICENSE.TXTH A D11-Jan-202214.8 KiB280229

Maintainers.txtH A D02-Dec-2024855

README.mdH A D09-Aug-20238.5 KiB207156

README.md

1# BOLT
2
3BOLT is a post-link optimizer developed to speed up large applications.
4It achieves the improvements by optimizing application's code layout based on
5execution profile gathered by sampling profiler, such as Linux `perf` tool.
6An overview of the ideas implemented in BOLT along with a discussion of its
7potential and current results is available in
8[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
9
10## Input Binary Requirements
11
12BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
13should have an unstripped symbol table, and, to get maximum performance gains,
14they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
15
16BOLT disassembles functions and reconstructs the control flow graph (CFG)
17before it runs optimizations. Since this is a nontrivial task,
18especially when indirect branches are present, we rely on certain heuristics
19to accomplish it. These heuristics have been tested on a code generated with
20Clang and GCC compilers. The main requirement for C/C++ code is not to rely
21on code layout properties, such as function pointer deltas.
22Assembly code can be processed too. Requirements for it include a clear
23separation of code and data, with data objects being placed into data
24sections/segments. If indirect jumps are used for intra-function control
25transfer (e.g., jump tables), the code patterns should be matching those
26generated by Clang/GCC.
27
28NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
29compiler option. Since GCC8 enables this option by default, you have to
30explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
31you are compiling with GCC8 or above.
32
33PIE and .so support has been added recently. Please report bugs if you
34encounter any issues.
35
36## Installation
37
38### Docker Image
39
40You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile).
41Alternatively, you can build BOLT manually using the steps below.
42
43### Manual Build
44
45BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
46tools. The build process is not much different from a regular LLVM build.
47The following instructions are assuming that you are running under Linux.
48
49Start with cloning LLVM repo:
50
51```
52> git clone https://github.com/llvm/llvm-project.git
53> mkdir build
54> cd build
55> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
56> ninja bolt
57```
58
59`llvm-bolt` will be available under `bin/`. Add this directory to your path to
60ensure the rest of the commands in this tutorial work.
61
62## Optimizing BOLT's Performance
63
64BOLT runs many internal passes in parallel. If you foresee heavy usage of
65BOLT, you can improve the processing time by linking against one of memory
66allocation libraries with good support for concurrency. E.g. to use jemalloc:
67
68```
69> sudo yum install jemalloc-devel
70> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
71```
72Or if you rather use tcmalloc:
73```
74> sudo yum install gperftools-devel
75> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
76```
77
78## Usage
79
80For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md).
81
82### Step 0
83
84In order to allow BOLT to re-arrange functions (in addition to re-arranging
85code within functions) in your program, it needs a little help from the linker.
86Add `--emit-relocs` to the final link step of your application. You can verify
87the presence of relocations by checking for `.rela.text` section in the binary.
88BOLT will also report if it detects relocations while processing the binary.
89
90### Step 1: Collect Profile
91
92This step is different for different kinds of executables. If you can invoke
93your program to run on a representative input from a command line, then check
94**For Applications** section below. If your program typically runs as a
95server/service, then skip to **For Services** section.
96
97The version of `perf` command used for the following steps has to support
98`-F brstack` option. We recommend using `perf` version 4.5 or later.
99
100#### For Applications
101
102This assumes you can run your program from a command line with a typical input.
103In this case, simply prepend the command line invocation with `perf`:
104```
105$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
106```
107
108#### For Services
109
110Once you get the service deployed and warmed-up, it is time to collect perf
111data with LBR (branch information). The exact perf command to use will depend
112on the service. E.g., to collect the data for all processes running on the
113server for the next 3 minutes use:
114```
115$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
116```
117
118Depending on the application, you may need more samples to be included with
119your profile. It's hard to tell upfront what would be a sweet spot for your
120application. We recommend the profile to cover 1B instructions as reported
121by BOLT `-dyno-stats` option. If you need to increase the number of samples
122in the profile, you can either run the `sleep` command for longer and use
123`-F<N>` option with `perf` to increase sampling frequency.
124
125Note that for profile collection we recommend using cycle events and not
126`BR_INST_RETIRED.*`. Empirically we found it to produce better results.
127
128If the collection of a profile with branches is not available, e.g., when you run on
129a VM or on hardware that does not support it, then you can use only sample
130events, such as cycles. In this case, the quality of the profile information
131would not be as good, and performance gains with BOLT are expected to be lower.
132
133#### With instrumentation
134
135If perf record is not available to you, you may collect profile by first
136instrumenting the binary with BOLT and then running it.
137```
138llvm-bolt <executable> -instrument -o <instrumented-executable>
139```
140
141After you run instrumented-executable with the desired workload, its BOLT
142profile should be ready for you in `/tmp/prof.fdata` and you can skip
143**Step 2**.
144
145Run BOLT with the `-help` option and check the category "BOLT instrumentation
146options" for a quick reference on instrumentation knobs.
147
148### Step 2: Convert Profile to BOLT Format
149
150NOTE: you can skip this step and feed `perf.data` directly to BOLT using
151experimental `-p perf.data` option.
152
153For this step, you will need `perf.data` file collected from the previous step and
154a copy of the binary that was running. The binary has to be either
155unstripped, or should have a symbol table intact (i.e., running `strip -g` is
156okay).
157
158Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
159```
160$ perf2bolt -p perf.data -o perf.fdata <executable>
161```
162
163This command will aggregate branch data from `perf.data` and store it in a
164format that is both more compact and more resilient to binary modifications.
165
166If the profile was collected without LBRs, you will need to add `-nl` flag to
167the command line above.
168
169### Step 3: Optimize with BOLT
170
171Once you have `perf.fdata` ready, you can use it for optimizations with
172BOLT. Assuming your environment is setup to include the right path, execute
173`llvm-bolt`:
174```
175$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
176```
177
178If you do need an updated debug info, then add `-update-debug-sections` option
179to the command above. The processing time will be slightly longer.
180
181For a full list of options see `-help`/`-help-hidden` output.
182
183The input binary for this step does not have to 100% match the binary used for
184profile collection in **Step 1**. This could happen when you are doing active
185development, and the source code constantly changes, yet you want to benefit
186from profile-guided optimizations. However, since the binary is not precisely the
187same, the profile information could become invalid or stale, and BOLT will
188report the number of functions with a stale profile. The higher the
189number, the less performance improvement should be expected. Thus, it is
190crucial to update `.fdata` for release branches.
191
192## Multiple Profiles
193
194Suppose your application can run in different modes, and you can generate
195multiple profiles for each one of them. To generate a single binary that can
196benefit all modes (assuming the profiles don't contradict each other) you can
197use `merge-fdata` tool:
198```
199$ merge-fdata *.fdata > combined.fdata
200```
201Use `combined.fdata` for **Step 3** above to generate a universally optimized
202binary.
203
204## License
205
206BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).
207