Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
cmake/modules/ | H | - | - | 36 | 29 | |
docs/ | H | - | - | 4,972 | 3,416 | |
include/bolt/ | H | - | - | 20,788 | 11,674 | |
lib/ | H | - | - | 62,480 | 48,134 | |
runtime/ | H | - | - | 3,153 | 2,392 | |
test/ | H | - | - | 101,144 | 96,467 | |
tools/ | H | - | - | 1,292 | 992 | |
unittests/ | H | - | - | 496 | 356 | |
utils/ | H | - | - | 1,134 | 853 | |
CMakeLists.txt | H A D | 17-Jan-2025 | 7.3 KiB | 153 | 131 | |
LICENSE.TXT | H A D | 11-Jan-2022 | 14.8 KiB | 280 | 229 | |
Maintainers.txt | H A D | 02-Dec-2024 | 855 | |||
README.md | H A D | 09-Aug-2023 | 8.5 KiB | 207 | 156 |
README.md
1# BOLT 2 3BOLT is a post-link optimizer developed to speed up large applications. 4It achieves the improvements by optimizing application's code layout based on 5execution profile gathered by sampling profiler, such as Linux `perf` tool. 6An overview of the ideas implemented in BOLT along with a discussion of its 7potential and current results is available in 8[CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/). 9 10## Input Binary Requirements 11 12BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries 13should have an unstripped symbol table, and, to get maximum performance gains, 14they should be linked with relocations (`--emit-relocs` or `-q` linker flag). 15 16BOLT disassembles functions and reconstructs the control flow graph (CFG) 17before it runs optimizations. Since this is a nontrivial task, 18especially when indirect branches are present, we rely on certain heuristics 19to accomplish it. These heuristics have been tested on a code generated with 20Clang and GCC compilers. The main requirement for C/C++ code is not to rely 21on code layout properties, such as function pointer deltas. 22Assembly code can be processed too. Requirements for it include a clear 23separation of code and data, with data objects being placed into data 24sections/segments. If indirect jumps are used for intra-function control 25transfer (e.g., jump tables), the code patterns should be matching those 26generated by Clang/GCC. 27 28NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition` 29compiler option. Since GCC8 enables this option by default, you have to 30explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if 31you are compiling with GCC8 or above. 32 33PIE and .so support has been added recently. Please report bugs if you 34encounter any issues. 35 36## Installation 37 38### Docker Image 39 40You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile). 41Alternatively, you can build BOLT manually using the steps below. 42 43### Manual Build 44 45BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM 46tools. The build process is not much different from a regular LLVM build. 47The following instructions are assuming that you are running under Linux. 48 49Start with cloning LLVM repo: 50 51``` 52> git clone https://github.com/llvm/llvm-project.git 53> mkdir build 54> cd build 55> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt" 56> ninja bolt 57``` 58 59`llvm-bolt` will be available under `bin/`. Add this directory to your path to 60ensure the rest of the commands in this tutorial work. 61 62## Optimizing BOLT's Performance 63 64BOLT runs many internal passes in parallel. If you foresee heavy usage of 65BOLT, you can improve the processing time by linking against one of memory 66allocation libraries with good support for concurrency. E.g. to use jemalloc: 67 68``` 69> sudo yum install jemalloc-devel 70> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt .... 71``` 72Or if you rather use tcmalloc: 73``` 74> sudo yum install gperftools-devel 75> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt .... 76``` 77 78## Usage 79 80For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md). 81 82### Step 0 83 84In order to allow BOLT to re-arrange functions (in addition to re-arranging 85code within functions) in your program, it needs a little help from the linker. 86Add `--emit-relocs` to the final link step of your application. You can verify 87the presence of relocations by checking for `.rela.text` section in the binary. 88BOLT will also report if it detects relocations while processing the binary. 89 90### Step 1: Collect Profile 91 92This step is different for different kinds of executables. If you can invoke 93your program to run on a representative input from a command line, then check 94**For Applications** section below. If your program typically runs as a 95server/service, then skip to **For Services** section. 96 97The version of `perf` command used for the following steps has to support 98`-F brstack` option. We recommend using `perf` version 4.5 or later. 99 100#### For Applications 101 102This assumes you can run your program from a command line with a typical input. 103In this case, simply prepend the command line invocation with `perf`: 104``` 105$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ... 106``` 107 108#### For Services 109 110Once you get the service deployed and warmed-up, it is time to collect perf 111data with LBR (branch information). The exact perf command to use will depend 112on the service. E.g., to collect the data for all processes running on the 113server for the next 3 minutes use: 114``` 115$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180 116``` 117 118Depending on the application, you may need more samples to be included with 119your profile. It's hard to tell upfront what would be a sweet spot for your 120application. We recommend the profile to cover 1B instructions as reported 121by BOLT `-dyno-stats` option. If you need to increase the number of samples 122in the profile, you can either run the `sleep` command for longer and use 123`-F<N>` option with `perf` to increase sampling frequency. 124 125Note that for profile collection we recommend using cycle events and not 126`BR_INST_RETIRED.*`. Empirically we found it to produce better results. 127 128If the collection of a profile with branches is not available, e.g., when you run on 129a VM or on hardware that does not support it, then you can use only sample 130events, such as cycles. In this case, the quality of the profile information 131would not be as good, and performance gains with BOLT are expected to be lower. 132 133#### With instrumentation 134 135If perf record is not available to you, you may collect profile by first 136instrumenting the binary with BOLT and then running it. 137``` 138llvm-bolt <executable> -instrument -o <instrumented-executable> 139``` 140 141After you run instrumented-executable with the desired workload, its BOLT 142profile should be ready for you in `/tmp/prof.fdata` and you can skip 143**Step 2**. 144 145Run BOLT with the `-help` option and check the category "BOLT instrumentation 146options" for a quick reference on instrumentation knobs. 147 148### Step 2: Convert Profile to BOLT Format 149 150NOTE: you can skip this step and feed `perf.data` directly to BOLT using 151experimental `-p perf.data` option. 152 153For this step, you will need `perf.data` file collected from the previous step and 154a copy of the binary that was running. The binary has to be either 155unstripped, or should have a symbol table intact (i.e., running `strip -g` is 156okay). 157 158Make sure `perf` is in your `PATH`, and execute `perf2bolt`: 159``` 160$ perf2bolt -p perf.data -o perf.fdata <executable> 161``` 162 163This command will aggregate branch data from `perf.data` and store it in a 164format that is both more compact and more resilient to binary modifications. 165 166If the profile was collected without LBRs, you will need to add `-nl` flag to 167the command line above. 168 169### Step 3: Optimize with BOLT 170 171Once you have `perf.fdata` ready, you can use it for optimizations with 172BOLT. Assuming your environment is setup to include the right path, execute 173`llvm-bolt`: 174``` 175$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats 176``` 177 178If you do need an updated debug info, then add `-update-debug-sections` option 179to the command above. The processing time will be slightly longer. 180 181For a full list of options see `-help`/`-help-hidden` output. 182 183The input binary for this step does not have to 100% match the binary used for 184profile collection in **Step 1**. This could happen when you are doing active 185development, and the source code constantly changes, yet you want to benefit 186from profile-guided optimizations. However, since the binary is not precisely the 187same, the profile information could become invalid or stale, and BOLT will 188report the number of functions with a stale profile. The higher the 189number, the less performance improvement should be expected. Thus, it is 190crucial to update `.fdata` for release branches. 191 192## Multiple Profiles 193 194Suppose your application can run in different modes, and you can generate 195multiple profiles for each one of them. To generate a single binary that can 196benefit all modes (assuming the profiles don't contradict each other) you can 197use `merge-fdata` tool: 198``` 199$ merge-fdata *.fdata > combined.fdata 200``` 201Use `combined.fdata` for **Step 3** above to generate a universally optimized 202binary. 203 204## License 205 206BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT). 207