xref: /llvm-project/bolt/docs/index.rst (revision 75c069584a3f97122c0defb292272048af5a0c2f)
1BOLT
2====
3
4BOLT is a post-link optimizer developed to speed up large applications.
5It achieves the improvements by optimizing application’s code layout
6based on execution profile gathered by sampling profiler, such as Linux
7``perf`` tool. An overview of the ideas implemented in BOLT along with a
8discussion of its potential and current results is available in `CGO’19
9paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__.
10
11Input Binary Requirements
12-------------------------
13
14BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the
15binaries should have an unstripped symbol table, and, to get maximum
16performance gains, they should be linked with relocations
17(``--emit-relocs`` or ``-q`` linker flag).
18
19BOLT disassembles functions and reconstructs the control flow graph
20(CFG) before it runs optimizations. Since this is a nontrivial task,
21especially when indirect branches are present, we rely on certain
22heuristics to accomplish it. These heuristics have been tested on a code
23generated with Clang and GCC compilers. The main requirement for C/C++
24code is not to rely on code layout properties, such as function pointer
25deltas. Assembly code can be processed too. Requirements for it include
26a clear separation of code and data, with data objects being placed into
27data sections/segments. If indirect jumps are used for intra-function
28control transfer (e.g., jump tables), the code patterns should be
29matching those generated by Clang/GCC.
30
31NOTE: BOLT is currently incompatible with the
32``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables
33this option by default, you have to explicitly disable it by adding
34``-fno-reorder-blocks-and-partition`` flag if you are compiling with
35GCC8 or above.
36
37NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM
38and GCC compilers. It offers several benefits over the previous DWARF
39v4. Currently, the support for v5 is a work in progress for BOLT. While
40you will be able to optimize binaries produced by the latest compilers,
41until the support is complete, you will not be able to update the debug
42info with ``-update-debug-sections``. To temporarily work around the
43issue, we recommend compiling binaries with ``-gdwarf-4`` option that
44forces DWARF v4 output.
45
46PIE and .so support has been added recently. Please report bugs if you
47encounter any issues.
48
49Installation
50------------
51
52Docker Image
53~~~~~~~~~~~~
54
55You can build and use the docker image containing BOLT using our `docker
56file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT
57manually using the steps below.
58
59Manual Build
60~~~~~~~~~~~~
61
62BOLT heavily uses LLVM libraries, and by design, it is built as one of
63LLVM tools. The build process is not much different from a regular LLVM
64build. The following instructions are assuming that you are running
65under Linux.
66
67Start with cloning LLVM repo:
68
69::
70
71    > git clone https://github.com/llvm/llvm-project.git
72    > mkdir build
73    > cd build
74    > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
75    > ninja bolt
76
77``llvm-bolt`` will be available under ``bin/``. Add this directory to
78your path to ensure the rest of the commands in this tutorial work.
79
80Optimizing BOLT’s Performance
81-----------------------------
82
83BOLT runs many internal passes in parallel. If you foresee heavy usage
84of BOLT, you can improve the processing time by linking against one of
85memory allocation libraries with good support for concurrency. E.g. to
86use jemalloc:
87
88::
89
90    > sudo yum install jemalloc-devel
91    > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
92
93Or if you rather use tcmalloc:
94
95::
96
97    > sudo yum install gperftools-devel
98    > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
99
100Usage
101-----
102
103For a complete practical guide of using BOLT see `Optimizing Clang with
104BOLT <docs/OptimizingClang.md>`__.
105
106Step 0
107~~~~~~
108
109In order to allow BOLT to re-arrange functions (in addition to
110re-arranging code within functions) in your program, it needs a little
111help from the linker. Add ``--emit-relocs`` to the final link step of
112your application. You can verify the presence of relocations by checking
113for ``.rela.text`` section in the binary. BOLT will also report if it
114detects relocations while processing the binary.
115
116Step 1: Collect Profile
117~~~~~~~~~~~~~~~~~~~~~~~
118
119This step is different for different kinds of executables. If you can
120invoke your program to run on a representative input from a command
121line, then check **For Applications** section below. If your program
122typically runs as a server/service, then skip to **For Services**
123section.
124
125The version of ``perf`` command used for the following steps has to
126support ``-F brstack`` option. We recommend using ``perf`` version 4.5
127or later.
128
129For Applications
130^^^^^^^^^^^^^^^^
131
132This assumes you can run your program from a command line with a typical
133input. In this case, simply prepend the command line invocation with
134``perf``:
135
136::
137
138    $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
139
140For Services
141^^^^^^^^^^^^
142
143Once you get the service deployed and warmed-up, it is time to collect
144perf data with LBR (branch information). The exact perf command to use
145will depend on the service. E.g., to collect the data for all processes
146running on the server for the next 3 minutes use:
147
148::
149
150    $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
151
152Depending on the application, you may need more samples to be included
153with your profile. It’s hard to tell upfront what would be a sweet spot
154for your application. We recommend the profile to cover 1B instructions
155as reported by BOLT ``-dyno-stats`` option. If you need to increase the
156number of samples in the profile, you can either run the ``sleep``
157command for longer and use ``-F<N>`` option with ``perf`` to increase
158sampling frequency.
159
160Note that for profile collection we recommend using cycle events and not
161``BR_INST_RETIRED.*``. Empirically we found it to produce better
162results.
163
164If the collection of a profile with branches is not available, e.g.,
165when you run on a VM or on hardware that does not support it, then you
166can use only sample events, such as cycles. In this case, the quality of
167the profile information would not be as good, and performance gains with
168BOLT are expected to be lower.
169
170With instrumentation
171^^^^^^^^^^^^^^^^^^^^
172
173If perf record is not available to you, you may collect profile by first
174instrumenting the binary with BOLT and then running it.
175
176::
177
178    llvm-bolt <executable> -instrument -o <instrumented-executable>
179
180After you run instrumented-executable with the desired workload, its
181BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can
182skip **Step 2**.
183
184Run BOLT with the ``-help`` option and check the category “BOLT
185instrumentation options” for a quick reference on instrumentation knobs.
186
187Step 2: Convert Profile to BOLT Format
188~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
189
190NOTE: you can skip this step and feed ``perf.data`` directly to BOLT
191using experimental ``-p perf.data`` option.
192
193For this step, you will need ``perf.data`` file collected from the
194previous step and a copy of the binary that was running. The binary has
195to be either unstripped, or should have a symbol table intact (i.e.,
196running ``strip -g`` is okay).
197
198Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``:
199
200::
201
202    $ perf2bolt -p perf.data -o perf.fdata <executable>
203
204This command will aggregate branch data from ``perf.data`` and store it
205in a format that is both more compact and more resilient to binary
206modifications.
207
208If the profile was collected without LBRs, you will need to add ``-nl``
209flag to the command line above.
210
211Step 3: Optimize with BOLT
212~~~~~~~~~~~~~~~~~~~~~~~~~~
213
214Once you have ``perf.fdata`` ready, you can use it for optimizations
215with BOLT. Assuming your environment is setup to include the right path,
216execute ``llvm-bolt``:
217
218::
219
220    $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
221
222If you do need an updated debug info, then add
223``-update-debug-sections`` option to the command above. The processing
224time will be slightly longer.
225
226For a full list of options see ``-help``/``-help-hidden`` output.
227
228The input binary for this step does not have to 100% match the binary
229used for profile collection in **Step 1**. This could happen when you
230are doing active development, and the source code constantly changes,
231yet you want to benefit from profile-guided optimizations. However,
232since the binary is not precisely the same, the profile information
233could become invalid or stale, and BOLT will report the number of
234functions with a stale profile. The higher the number, the less
235performance improvement should be expected. Thus, it is crucial to
236update ``.fdata`` for release branches.
237
238Multiple Profiles
239-----------------
240
241Suppose your application can run in different modes, and you can
242generate multiple profiles for each one of them. To generate a single
243binary that can benefit all modes (assuming the profiles don’t
244contradict each other) you can use ``merge-fdata`` tool:
245
246::
247
248    $ merge-fdata *.fdata > combined.fdata
249
250Use ``combined.fdata`` for **Step 3** above to generate a universally
251optimized binary.
252
253License
254-------
255
256BOLT is licensed under the `Apache License v2.0 with LLVM
257Exceptions <./LICENSE.TXT>`__.
258