1BOLT 2==== 3 4BOLT is a post-link optimizer developed to speed up large applications. 5It achieves the improvements by optimizing application’s code layout 6based on execution profile gathered by sampling profiler, such as Linux 7``perf`` tool. An overview of the ideas implemented in BOLT along with a 8discussion of its potential and current results is available in `CGO’19 9paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__. 10 11Input Binary Requirements 12------------------------- 13 14BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the 15binaries should have an unstripped symbol table, and, to get maximum 16performance gains, they should be linked with relocations 17(``--emit-relocs`` or ``-q`` linker flag). 18 19BOLT disassembles functions and reconstructs the control flow graph 20(CFG) before it runs optimizations. Since this is a nontrivial task, 21especially when indirect branches are present, we rely on certain 22heuristics to accomplish it. These heuristics have been tested on a code 23generated with Clang and GCC compilers. The main requirement for C/C++ 24code is not to rely on code layout properties, such as function pointer 25deltas. Assembly code can be processed too. Requirements for it include 26a clear separation of code and data, with data objects being placed into 27data sections/segments. If indirect jumps are used for intra-function 28control transfer (e.g., jump tables), the code patterns should be 29matching those generated by Clang/GCC. 30 31NOTE: BOLT is currently incompatible with the 32``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables 33this option by default, you have to explicitly disable it by adding 34``-fno-reorder-blocks-and-partition`` flag if you are compiling with 35GCC8 or above. 36 37NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM 38and GCC compilers. It offers several benefits over the previous DWARF 39v4. Currently, the support for v5 is a work in progress for BOLT. While 40you will be able to optimize binaries produced by the latest compilers, 41until the support is complete, you will not be able to update the debug 42info with ``-update-debug-sections``. To temporarily work around the 43issue, we recommend compiling binaries with ``-gdwarf-4`` option that 44forces DWARF v4 output. 45 46PIE and .so support has been added recently. Please report bugs if you 47encounter any issues. 48 49Installation 50------------ 51 52Docker Image 53~~~~~~~~~~~~ 54 55You can build and use the docker image containing BOLT using our `docker 56file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT 57manually using the steps below. 58 59Manual Build 60~~~~~~~~~~~~ 61 62BOLT heavily uses LLVM libraries, and by design, it is built as one of 63LLVM tools. The build process is not much different from a regular LLVM 64build. The following instructions are assuming that you are running 65under Linux. 66 67Start with cloning LLVM repo: 68 69:: 70 71 > git clone https://github.com/llvm/llvm-project.git 72 > mkdir build 73 > cd build 74 > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt" 75 > ninja bolt 76 77``llvm-bolt`` will be available under ``bin/``. Add this directory to 78your path to ensure the rest of the commands in this tutorial work. 79 80Optimizing BOLT’s Performance 81----------------------------- 82 83BOLT runs many internal passes in parallel. If you foresee heavy usage 84of BOLT, you can improve the processing time by linking against one of 85memory allocation libraries with good support for concurrency. E.g. to 86use jemalloc: 87 88:: 89 90 > sudo yum install jemalloc-devel 91 > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt .... 92 93Or if you rather use tcmalloc: 94 95:: 96 97 > sudo yum install gperftools-devel 98 > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt .... 99 100Usage 101----- 102 103For a complete practical guide of using BOLT see `Optimizing Clang with 104BOLT <docs/OptimizingClang.md>`__. 105 106Step 0 107~~~~~~ 108 109In order to allow BOLT to re-arrange functions (in addition to 110re-arranging code within functions) in your program, it needs a little 111help from the linker. Add ``--emit-relocs`` to the final link step of 112your application. You can verify the presence of relocations by checking 113for ``.rela.text`` section in the binary. BOLT will also report if it 114detects relocations while processing the binary. 115 116Step 1: Collect Profile 117~~~~~~~~~~~~~~~~~~~~~~~ 118 119This step is different for different kinds of executables. If you can 120invoke your program to run on a representative input from a command 121line, then check **For Applications** section below. If your program 122typically runs as a server/service, then skip to **For Services** 123section. 124 125The version of ``perf`` command used for the following steps has to 126support ``-F brstack`` option. We recommend using ``perf`` version 4.5 127or later. 128 129For Applications 130^^^^^^^^^^^^^^^^ 131 132This assumes you can run your program from a command line with a typical 133input. In this case, simply prepend the command line invocation with 134``perf``: 135 136:: 137 138 $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ... 139 140For Services 141^^^^^^^^^^^^ 142 143Once you get the service deployed and warmed-up, it is time to collect 144perf data with LBR (branch information). The exact perf command to use 145will depend on the service. E.g., to collect the data for all processes 146running on the server for the next 3 minutes use: 147 148:: 149 150 $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180 151 152Depending on the application, you may need more samples to be included 153with your profile. It’s hard to tell upfront what would be a sweet spot 154for your application. We recommend the profile to cover 1B instructions 155as reported by BOLT ``-dyno-stats`` option. If you need to increase the 156number of samples in the profile, you can either run the ``sleep`` 157command for longer and use ``-F<N>`` option with ``perf`` to increase 158sampling frequency. 159 160Note that for profile collection we recommend using cycle events and not 161``BR_INST_RETIRED.*``. Empirically we found it to produce better 162results. 163 164If the collection of a profile with branches is not available, e.g., 165when you run on a VM or on hardware that does not support it, then you 166can use only sample events, such as cycles. In this case, the quality of 167the profile information would not be as good, and performance gains with 168BOLT are expected to be lower. 169 170With instrumentation 171^^^^^^^^^^^^^^^^^^^^ 172 173If perf record is not available to you, you may collect profile by first 174instrumenting the binary with BOLT and then running it. 175 176:: 177 178 llvm-bolt <executable> -instrument -o <instrumented-executable> 179 180After you run instrumented-executable with the desired workload, its 181BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can 182skip **Step 2**. 183 184Run BOLT with the ``-help`` option and check the category “BOLT 185instrumentation options” for a quick reference on instrumentation knobs. 186 187Step 2: Convert Profile to BOLT Format 188~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 189 190NOTE: you can skip this step and feed ``perf.data`` directly to BOLT 191using experimental ``-p perf.data`` option. 192 193For this step, you will need ``perf.data`` file collected from the 194previous step and a copy of the binary that was running. The binary has 195to be either unstripped, or should have a symbol table intact (i.e., 196running ``strip -g`` is okay). 197 198Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``: 199 200:: 201 202 $ perf2bolt -p perf.data -o perf.fdata <executable> 203 204This command will aggregate branch data from ``perf.data`` and store it 205in a format that is both more compact and more resilient to binary 206modifications. 207 208If the profile was collected without LBRs, you will need to add ``-nl`` 209flag to the command line above. 210 211Step 3: Optimize with BOLT 212~~~~~~~~~~~~~~~~~~~~~~~~~~ 213 214Once you have ``perf.fdata`` ready, you can use it for optimizations 215with BOLT. Assuming your environment is setup to include the right path, 216execute ``llvm-bolt``: 217 218:: 219 220 $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats 221 222If you do need an updated debug info, then add 223``-update-debug-sections`` option to the command above. The processing 224time will be slightly longer. 225 226For a full list of options see ``-help``/``-help-hidden`` output. 227 228The input binary for this step does not have to 100% match the binary 229used for profile collection in **Step 1**. This could happen when you 230are doing active development, and the source code constantly changes, 231yet you want to benefit from profile-guided optimizations. However, 232since the binary is not precisely the same, the profile information 233could become invalid or stale, and BOLT will report the number of 234functions with a stale profile. The higher the number, the less 235performance improvement should be expected. Thus, it is crucial to 236update ``.fdata`` for release branches. 237 238Multiple Profiles 239----------------- 240 241Suppose your application can run in different modes, and you can 242generate multiple profiles for each one of them. To generate a single 243binary that can benefit all modes (assuming the profiles don’t 244contradict each other) you can use ``merge-fdata`` tool: 245 246:: 247 248 $ merge-fdata *.fdata > combined.fdata 249 250Use ``combined.fdata`` for **Step 3** above to generate a universally 251optimized binary. 252 253License 254------- 255 256BOLT is licensed under the `Apache License v2.0 with LLVM 257Exceptions <./LICENSE.TXT>`__. 258