1========================== 2Symbolizer Markup Format 3========================== 4 5.. contents:: 6 :local: 7 8Overview 9======== 10 11This document defines a text format for log messages that can be processed by a 12symbolizing filter. The basic idea is that logging code emits text that contains 13raw address values and so forth, without the logging code doing any real work to 14convert those values to human-readable form. Instead, logging text uses the 15markup format defined here to identify pieces of information that should be 16converted to human-readable form after the fact. As with other markup formats, 17the expectation is that most of the text will be displayed as is, while the 18markup elements will be replaced with expanded text, or converted into active UI 19elements, that present more details in symbolic form. 20 21This means there is no need for symbol tables, DWARF debugging sections, or 22similar information to be directly accessible at runtime. There is also no need 23at runtime for any logic intended to compute human-readable presentation of 24information, such as C++ symbol demangling. Instead, logging must include markup 25elements that give the contextual information necessary to make sense of the raw 26data, such as memory layout details. 27 28This format identifies markup elements with a syntax that is both simple and 29distinctive. It's simple enough to be matched and parsed with straightforward 30code. It's distinctive enough that character sequences that look like the start 31or end of a markup element should rarely if ever appear incidentally in logging 32text. It's specifically intended not to require sanitizing plain text, such as 33the HTML/XML requirement to replace ``<`` with ``<`` and the like. 34 35:doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>` includes a symbolizing 36filter via its ``--filter-markup`` option. Also, LLVM utilites emit stack 37traces as markup when the ``LLVM_ENABLE_SYMBOLIZER_MARKUP`` environment 38variable is set. 39 40Scope and assumptions 41===================== 42 43A symbolizing filter implementation will be independent both of the target 44operating system and machine architecture where the logs are generated and of 45the host operating system and machine architecture where the filter runs. 46 47This format assumes that the symbolizing filter processes intact whole lines. If 48long lines might be split during some stage of a logging pipeline, they must be 49reassembled to restore the original line breaks before feeding lines into the 50symbolizing filter. Most markup elements must appear entirely on a single line 51(often with other text before and/or after the markup element). There are some 52markup elements that are specified to span lines, with line breaks in the middle 53of the element. Even in those cases, the filter is not expected to handle line 54breaks in arbitrary places inside a markup element, but only inside certain 55fields. 56 57This format assumes that the symbolizing filter processes a coherent stream of 58log lines from a single process address space context. If a logging stream 59interleaves log lines from more than one process, these must be collated into 60separate per-process log streams and each stream processed by a separate 61instance of the symbolizing filter. Because the kernel and user processes use 62disjoint address regions in most operating systems, a single user process 63address space plus the kernel address space can be treated as a single address 64space for symbolization purposes if desired. 65 66Dependence on Build IDs 67======================= 68 69The symbolizer markup scheme relies on contextual information about runtime 70memory address layout to make it possible to convert markup elements into useful 71symbolic form. This relies on having an unmistakable identification of which 72binary was loaded at each address. 73 74An ELF Build ID is the payload of an ELF note with name ``"GNU"`` and type 75``NT_GNU_BUILD_ID``, a unique byte sequence that identifies a particular binary 76(executable, shared library, loadable module, or driver module). The linker 77generates this automatically based on a hash that includes the complete symbol 78table and debugging information, even if this is later stripped from the binary. 79 80This specification uses the ELF Build ID as the sole means of identifying 81binaries. Each binary relevant to the log must have been linked with a unique 82Build ID. The symbolizing filter must have some means of mapping a Build ID back 83to the original ELF binary (either the whole unstripped binary, or a stripped 84binary paired with a separate debug file). 85 86Colorization 87============ 88 89The markup format supports a restricted subset of ANSI X3.64 SGR (Select Graphic 90Rendition) control sequences. These are unlike other markup elements: 91 92* They specify presentation details (bold or colors) rather than semantic 93 information. The association of semantic meaning with color (e.g. red for 94 errors) is chosen by the code doing the logging, rather than by the UI 95 presentation of the symbolizing filter. This is a concession to existing code 96 (e.g. LLVM sanitizer runtimes) that use specific colors and would require 97 substantial changes to generate semantic markup instead. 98 99* A single control sequence changes "the state", rather than being an 100 hierarchical structure that surrounds affected text. 101 102The filter processes ANSI SGR control sequences only within a single line. If a 103control sequence to enter a bold or color state is encountered, it's expected 104that the control sequence to reset to default state will be encountered before 105the end of that line. If a "dangling" state is left at the end of a line, the 106filter may reset to default state for the next line. 107 108An SGR control sequence is not interpreted inside any other markup element. 109However, other markup elements may appear between SGR control sequences and the 110color/bold state is expected to apply to the symbolic output that replaces the 111markup element in the filter's output. 112 113The accepted SGR control sequences all have the form ``"\033[%um"`` (expressed here 114using C string syntax), where ``%u`` is one of these: 115 116==== ============================ =============================================== 117Code Effect Notes 118==== ============================ =============================================== 1190 Reset to default formatting. 1201 Bold text Combines with color states, doesn't reset them. 12130 Black foreground 12231 Red foreground 12332 Green foreground 12433 Yellow foreground 12534 Blue foreground 12635 Magenta foreground 12736 Cyan foreground 12837 White foreground 129==== ============================ =============================================== 130 131Common markup element syntax 132============================ 133 134All the markup elements share a common syntactic structure to facilitate simple 135matching and parsing code. Each element has the form:: 136 137 {{{tag:fields}}} 138 139``tag`` identifies one of the element types described below, and is always a 140short alphabetic string that must be in lower case. The rest of the element 141consists of one or more fields. Fields are separated by ``:`` and cannot contain 142any ``:`` or ``}`` characters. How many fields must be or may be present and 143what they contain is specified for each element type. 144 145No markup elements or ANSI SGR control sequences are interpreted inside the 146contents of a field. 147 148Implementations must ignore markup fields after those expected; this allows 149adding new fields to backwards-compatibly extend elements. Implementations need 150not ignore them silently, but the element should behave otherwise as if the 151fields were removed. 152 153In the descriptions of each element type, ``printf``-style placeholders indicate 154field contents: 155 156``%s`` 157 A string of printable characters, not including ``:`` or ``}``. 158 159``%p`` 160 An address value represented by ``0x`` followed by an even number of 161 hexadecimal digits (using either lower-case or upper-case for ``A``–``F``). 162 If the digits are all ``0`` then the ``0x`` prefix may be omitted. No more 163 than 16 hexadecimal digits are expected to appear in a single value (64 bits). 164 165``%u`` 166 A nonnegative decimal integer. 167 168``%i`` 169 A nonnegative integer. The digits are hexadecimal if prefixed by ``0x``, octal 170 if prefixed by ``0``, or decimal otherwise. 171 172``%x`` 173 A sequence of an even number of hexadecimal digits (using either lower-case or 174 upper-case for ``A``–``F``), with no ``0x`` prefix. This represents an 175 arbitrary sequence of bytes, such as an ELF Build ID. 176 177Presentation elements 178===================== 179 180These are elements that convey a specific program entity to be displayed in 181human-readable symbolic form. 182 183``{{{symbol:%s}}}`` 184 Here ``%s`` is the linkage name for a symbol or type. It may require 185 demangling according to language ABI rules. Even for unmangled names, it's 186 recommended that this markup element be used to identify a symbol name so that 187 it can be presented distinctively. 188 189 Examples:: 190 191 {{{symbol:_ZN7Mangled4NameEv}}} 192 {{{symbol:foobar}}} 193 194``{{{pc:%p}}}``, ``{{{pc:%p:ra}}}``, ``{{{pc:%p:pc}}}`` 195 196 Here ``%p`` is the memory address of a code location. It might be presented as a 197 function name and source location. The second two forms distinguish the kind of 198 code location, as described in detail for bt elements below. 199 200 Examples:: 201 202 {{{pc:0x12345678}}} 203 {{{pc:0xffffffff9abcdef0}}} 204 205``{{{data:%p}}}`` 206 207 Here ``%p`` is the memory address of a data location. It might be presented as 208 the name of a global variable at that location. 209 210 Examples:: 211 212 {{{data:0x12345678}}} 213 {{{data:0xffffffff9abcdef0}}} 214 215``{{{bt:%u:%p}}}``, ``{{{bt:%u:%p:ra}}}``, ``{{{bt:%u:%p:pc}}}`` 216 217 This represents one frame in a backtrace. It usually appears on a line by 218 itself (surrounded only by whitespace), in a sequence of such lines with 219 ascending frame numbers. So the human-readable output might be formatted 220 assuming that, such that it looks good for a sequence of bt elements each 221 alone on its line with uniform indentation of each line. But it can appear 222 anywhere, so the filter should not remove any non-whitespace text surrounding 223 the element. 224 225 Here ``%u`` is the frame number, which starts at zero for the location of the 226 fault being identified, increments to one for the caller of frame zero's call 227 frame, to two for the caller of frame one, etc. ``%p`` is the memory address 228 of a code location. 229 230 Code locations in a backtrace come from two distinct sources. Most backtrace 231 frames describe a return address code location, i.e. the instruction 232 immediately after a call instruction. This is the location of code that has 233 yet to run, since the function called there has not yet returned. Hence the 234 code location of actual interest is usually the call site itself rather than 235 the return address, i.e. one instruction earlier. When presenting the source 236 location for a return address frame, the symbolizing filter will subtract one 237 byte or one instruction length from the actual return address for the call 238 site, with the intent that the address logged can be translated directly to a 239 source location for the call site and not for the apparent return site 240 thereafter (which can be confusing). When inlined functions are involved, the 241 call site and the return site can appear to be in different functions at 242 entirely unrelated source locations rather than just a line away, making the 243 confusion of showing the return site rather the call site quite severe. 244 245 Often the first frame in a backtrace ("frame zero") identifies the precise 246 code location of a fault, trap, or asynchronous interrupt rather than a return 247 address. At other times, even the first frame is actually a return address 248 (for example, backtraces collected at the time of an object allocation and 249 reported later when the allocated object is used or misused). When a system 250 supports in-thread trap handling, there may also be frames after the first 251 that represent a precise interrupted code location rather than a return 252 address, presented as the "caller" of a trap handler function (for example, 253 signal handlers in POSIX systems). 254 255 Return address frames are identified by the ``:ra`` suffix. Precise code 256 location frames are identified by the ``:pc`` suffix. 257 258 Traditional practice has often been to collect backtraces as simple address 259 lists, losing the distinction between return address code locations and 260 precise code locations. Some such code applies the "subtract one" adjustment 261 described above to the address values before reporting them, and it's not 262 always clear or consistent whether this adjustment has been applied or not. 263 These ambiguous cases are supported by the ``bt`` and ``pc`` forms with no 264 ``:ra`` or ``:pc`` suffix, which indicate it's unclear which sort of code 265 location this is. However, it's highly recommended that all emitters use the 266 suffixed forms and deliver address values with no adjustments applied. When 267 traditional practice has been ambiguous, the majority of cases seem to have 268 been of printing addresses that are return address code locations and printing 269 them without adjustment. So the symbolizing filter will usually apply the 270 "subtract one byte" adjustment to an address printed without a disambiguating 271 suffix. Assuming that a call instruction is longer than one byte on all 272 supported machines, applying the "subtract one byte" adjustment a second time 273 still results in an address somewhere in the call instruction, so a little 274 sloppiness here often does little or no harm. 275 276 Examples:: 277 278 {{{bt:0:0x12345678:pc}}} 279 {{{bt:1:0xffffffff9abcdef0:ra}}} 280 281``{{{hexdict:...}}}`` [#not_yet_implemented]_ 282 283 This element can span multiple lines. Here ``...`` is a sequence of key-value 284 pairs where a single ``:`` separates each key from its value, and arbitrary 285 whitespace separates the pairs. The value (right-hand side) of each pair 286 either is one or more ``0`` digits, or is ``0x`` followed by hexadecimal 287 digits. Each value might be a memory address or might be some other integer 288 (including an integer that looks like a likely memory address but actually has 289 an unrelated purpose). When the contextual information about the memory layout 290 suggests that a given value could be a code location or a global variable data 291 address, it might be presented as a source location or variable name or with 292 active UI that makes such interpretation optionally visible. 293 294 The intended use is for things like register dumps, where the emitter doesn't 295 know which values might have a symbolic interpretation but a presentation that 296 makes plausible symbolic interpretations available might be very useful to 297 someone reading the log. At the same time, a flat text presentation should 298 usually avoid interfering too much with the original contents and formatting 299 of the dump. For example, it might use footnotes with source locations for 300 values that appear to be code locations. An active UI presentation might show 301 the dump text as is, but highlight values with symbolic information available 302 and pop up a presentation of symbolic details when a value is selected. 303 304 Example:: 305 306 {{{hexdict: 307 CS: 0 RIP: 0x6ee17076fb80 EFL: 0x10246 CR2: 0 308 RAX: 0xc53d0acbcf0 RBX: 0x1e659ea7e0d0 RCX: 0 RDX: 0x6ee1708300cc 309 RSI: 0 RDI: 0x6ee170830040 RBP: 0x3b13734898e0 RSP: 0x3b13734898d8 310 R8: 0x3b1373489860 R9: 0x2776ff4f R10: 0x2749d3e9a940 R11: 0x246 311 R12: 0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14: 0x1e659ea7e108 R15: 0xc53d0acbcf0 312 }}} 313 314Trigger elements 315================ 316 317These elements cause an external action and will be presented to the user in a 318human readable form. Generally they trigger an external action to occur that 319results in a linkable page. The link or some other informative information about 320the external action can then be presented to the user. 321 322``{{{dumpfile:%s:%s}}}`` [#not_yet_implemented]_ 323 324 Here the first ``%s`` is an identifier for a type of dump and the second 325 ``%s`` is an identifier for a particular dump that's just been published. The 326 types of dumps, the exact meaning of "published", and the nature of the 327 identifier are outside the scope of the markup format per se. In general it 328 might correspond to writing a file by that name or something similar. 329 330 This element may trigger additional post-processing work beyond symbolizing 331 the markup. It indicates that a dump file of some sort has been published. 332 Some logic attached to the symbolizing filter may understand certain types of 333 dump file and trigger additional post-processing of the dump file upon 334 encountering this element (e.g. generating visualizations, symbolization). The 335 expectation is that the information collected from contextual elements 336 (described below) in the logging stream may be necessary to decode the content 337 of the dump. So if the symbolizing filter triggers other processing, it may 338 need to feed some distilled form of the contextual information to those 339 processes. 340 341 An example of a type identifier is ``sancov``, for dumps from LLVM 342 `SanitizerCoverage <https://clang.llvm.org/docs/SanitizerCoverage.html>`_. 343 344 Example:: 345 346 {{{dumpfile:sancov:sancov.8675}}} 347 348Contextual elements 349=================== 350 351These are elements that supply information necessary to convert presentation 352elements to symbolic form. Unlike presentation elements, they are not directly 353related to the surrounding text. Contextual elements should appear alone on 354lines with no other non-whitespace text, so that the symbolizing filter might 355elide the whole line from its output without hiding any other log text. 356 357The contextual elements themselves do not necessarily need to be presented in 358human-readable output. However, the information they impart may be essential to 359understanding the logging text even after symbolization. So it's recommended 360that this information be preserved in some form when the original raw log with 361markup may no longer be readily accessible for whatever reason. 362 363Contextual elements should appear in the logging stream before they are needed. 364That is, if some piece of context may affect how the symbolizing filter would 365interpret or present a later presentation element, the necessary contextual 366elements should have appeared somewhere earlier in the logging stream. It should 367always be possible for the symbolizing filter to be implemented as a single pass 368over the raw logging stream, accumulating context and massaging text as it goes. 369 370``{{{reset}}}`` 371 372 This should be output before any other contextual element. The need for this 373 contextual element is to support implementations that handle logs coming from 374 multiple processes. Such implementations might not know when a new process 375 starts or ends. Because some identifying information (like process IDs) might 376 be the same between old and new processes, a way is needed to distinguish two 377 processes with such identical identifying information. This element informs 378 such implementations to reset the state of a filter so that information from a 379 previous process's contextual elements is not assumed for new process that 380 just happens have the same identifying information. 381 382``{{{module:%i:%s:%s:...}}}`` 383 384 This element represents a so-called "module". A "module" is a single linked 385 binary, such as a loaded ELF file. Usually each module occupies a contiguous 386 range of memory. 387 388 Here ``%i`` is the module ID which is used by other contextual elements to 389 refer to this module. The first ``%s`` is a human-readable identifier for the 390 module, such as an ELF ``DT_SONAME`` string or a file name; but it might be 391 empty. It's only for casual information. Only the module ID is used to refer 392 to this module in other contextual elements, never the ``%s`` string. The 393 ``module`` element defining a module ID must always be emitted before any 394 other elements that refer to that module ID, so that a filter never needs to 395 keep track of dangling references. The second ``%s`` is the module type and it 396 determines what the remaining fields are. The following module types are 397 supported: 398 399 * ``elf:%x`` 400 401 Here ``%x`` encodes an ELF Build ID. The Build ID should refer to a single 402 linked binary. The Build ID string is the sole way to identify the binary from 403 which this module was loaded. 404 405 Example:: 406 407 {{{module:1:libc.so:elf:83238ab56ba10497}}} 408 409``{{{mmap:%p:%i:...}}}`` 410 411 This contextual element is used to give information about a particular region 412 in memory. ``%p`` is the starting address and ``%i`` gives the size in hex of the 413 region of memory. The ``...`` part can take different forms to give different 414 information about the specified region of memory. The allowed forms are the 415 following: 416 417 * ``load:%i:%s:%p`` 418 419 This subelement informs the filter that a segment was loaded from a module. 420 The module is identified by its module ID ``%i``. The ``%s`` is one or more of 421 the letters 'r', 'w', and 'x' (in that order and in either upper or lower 422 case) to indicate this segment of memory is readable, writable, and/or 423 executable. The symbolizing filter can use this information to guess whether 424 an address is a likely code address or a likely data address in the given 425 module. The remaining ``%p`` gives the module relative address. For ELF files 426 the module relative address will be the ``p_vaddr`` of the associated program 427 header. For example if your module's executable segment has 428 ``p_vaddr=0x1000``, ``p_memsz=0x1234``, and was loaded at ``0x7acba69d5000`` 429 then you need to subtract ``0x7acba69d4000`` from any address between 430 ``0x7acba69d5000`` and ``0x7acba69d6234`` to get the module relative address. 431 The starting address will usually have been rounded down to the active page 432 size, and the size rounded up. 433 434 Example:: 435 436 {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}} 437 438.. rubric:: Footnotes 439 440.. [#not_yet_implemented] This markup element is not yet implemented in 441 :doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>`. 442