xref: /llvm-project/llvm/docs/SymbolizerMarkupFormat.rst (revision 22b9404f09dc4411e4f6d05f4a1724897e5e131a)
1==========================
2Symbolizer Markup Format
3==========================
4
5.. contents::
6   :local:
7
8Overview
9========
10
11This document defines a text format for log messages that can be processed by a
12symbolizing filter. The basic idea is that logging code emits text that contains
13raw address values and so forth, without the logging code doing any real work to
14convert those values to human-readable form. Instead, logging text uses the
15markup format defined here to identify pieces of information that should be
16converted to human-readable form after the fact. As with other markup formats,
17the expectation is that most of the text will be displayed as is, while the
18markup elements will be replaced with expanded text, or converted into active UI
19elements, that present more details in symbolic form.
20
21This means there is no need for symbol tables, DWARF debugging sections, or
22similar information to be directly accessible at runtime. There is also no need
23at runtime for any logic intended to compute human-readable presentation of
24information, such as C++ symbol demangling. Instead, logging must include markup
25elements that give the contextual information necessary to make sense of the raw
26data, such as memory layout details.
27
28This format identifies markup elements with a syntax that is both simple and
29distinctive. It's simple enough to be matched and parsed with straightforward
30code. It's distinctive enough that character sequences that look like the start
31or end of a markup element should rarely if ever appear incidentally in logging
32text. It's specifically intended not to require sanitizing plain text, such as
33the HTML/XML requirement to replace ``<`` with ``&lt;`` and the like.
34
35:doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>` includes a symbolizing
36filter via its ``--filter-markup`` option. Also, LLVM utilites emit stack
37traces as markup when the ``LLVM_ENABLE_SYMBOLIZER_MARKUP`` environment
38variable is set.
39
40Scope and assumptions
41=====================
42
43A symbolizing filter implementation will be independent both of the target
44operating system and machine architecture where the logs are generated and of
45the host operating system and machine architecture where the filter runs.
46
47This format assumes that the symbolizing filter processes intact whole lines. If
48long lines might be split during some stage of a logging pipeline, they must be
49reassembled to restore the original line breaks before feeding lines into the
50symbolizing filter. Most markup elements must appear entirely on a single line
51(often with other text before and/or after the markup element). There are some
52markup elements that are specified to span lines, with line breaks in the middle
53of the element. Even in those cases, the filter is not expected to handle line
54breaks in arbitrary places inside a markup element, but only inside certain
55fields.
56
57This format assumes that the symbolizing filter processes a coherent stream of
58log lines from a single process address space context. If a logging stream
59interleaves log lines from more than one process, these must be collated into
60separate per-process log streams and each stream processed by a separate
61instance of the symbolizing filter. Because the kernel and user processes use
62disjoint address regions in most operating systems, a single user process
63address space plus the kernel address space can be treated as a single address
64space for symbolization purposes if desired.
65
66Dependence on Build IDs
67=======================
68
69The symbolizer markup scheme relies on contextual information about runtime
70memory address layout to make it possible to convert markup elements into useful
71symbolic form. This relies on having an unmistakable identification of which
72binary was loaded at each address.
73
74An ELF Build ID is the payload of an ELF note with name ``"GNU"`` and type
75``NT_GNU_BUILD_ID``, a unique byte sequence that identifies a particular binary
76(executable, shared library, loadable module, or driver module). The linker
77generates this automatically based on a hash that includes the complete symbol
78table and debugging information, even if this is later stripped from the binary.
79
80This specification uses the ELF Build ID as the sole means of identifying
81binaries. Each binary relevant to the log must have been linked with a unique
82Build ID. The symbolizing filter must have some means of mapping a Build ID back
83to the original ELF binary (either the whole unstripped binary, or a stripped
84binary paired with a separate debug file).
85
86Colorization
87============
88
89The markup format supports a restricted subset of ANSI X3.64 SGR (Select Graphic
90Rendition) control sequences. These are unlike other markup elements:
91
92* They specify presentation details (bold or colors) rather than semantic
93  information. The association of semantic meaning with color (e.g. red for
94  errors) is chosen by the code doing the logging, rather than by the UI
95  presentation of the symbolizing filter. This is a concession to existing code
96  (e.g. LLVM sanitizer runtimes) that use specific colors and would require
97  substantial changes to generate semantic markup instead.
98
99* A single control sequence changes "the state", rather than being an
100  hierarchical structure that surrounds affected text.
101
102The filter processes ANSI SGR control sequences only within a single line. If a
103control sequence to enter a bold or color state is encountered, it's expected
104that the control sequence to reset to default state will be encountered before
105the end of that line. If a "dangling" state is left at the end of a line, the
106filter may reset to default state for the next line.
107
108An SGR control sequence is not interpreted inside any other markup element.
109However, other markup elements may appear between SGR control sequences and the
110color/bold state is expected to apply to the symbolic output that replaces the
111markup element in the filter's output.
112
113The accepted SGR control sequences all have the form ``"\033[%um"`` (expressed here
114using C string syntax), where ``%u`` is one of these:
115
116==== ============================ ===============================================
117Code Effect                       Notes
118==== ============================ ===============================================
1190    Reset to default formatting.
1201    Bold text                    Combines with color states, doesn't reset them.
12130   Black foreground
12231   Red foreground
12332   Green foreground
12433   Yellow foreground
12534   Blue foreground
12635   Magenta foreground
12736   Cyan foreground
12837   White foreground
129==== ============================ ===============================================
130
131Common markup element syntax
132============================
133
134All the markup elements share a common syntactic structure to facilitate simple
135matching and parsing code. Each element has the form::
136
137  {{{tag:fields}}}
138
139``tag`` identifies one of the element types described below, and is always a
140short alphabetic string that must be in lower case. The rest of the element
141consists of one or more fields. Fields are separated by ``:`` and cannot contain
142any ``:`` or ``}`` characters. How many fields must be or may be present and
143what they contain is specified for each element type.
144
145No markup elements or ANSI SGR control sequences are interpreted inside the
146contents of a field.
147
148Implementations must ignore markup fields after those expected; this allows
149adding new fields to backwards-compatibly extend elements. Implementations need
150not ignore them silently, but the element should behave otherwise as if the
151fields were removed.
152
153In the descriptions of each element type, ``printf``-style placeholders indicate
154field contents:
155
156``%s``
157  A string of printable characters, not including ``:`` or ``}``.
158
159``%p``
160  An address value represented by ``0x`` followed by an even number of
161  hexadecimal digits (using either lower-case or upper-case for ``A``–``F``).
162  If the digits are all ``0`` then the ``0x`` prefix may be omitted. No more
163  than 16 hexadecimal digits are expected to appear in a single value (64 bits).
164
165``%u``
166  A nonnegative decimal integer.
167
168``%i``
169  A nonnegative integer. The digits are hexadecimal if prefixed by ``0x``, octal
170  if prefixed by ``0``, or decimal otherwise.
171
172``%x``
173  A sequence of an even number of hexadecimal digits (using either lower-case or
174  upper-case for ``A``–``F``), with no ``0x`` prefix. This represents an
175  arbitrary sequence of bytes, such as an ELF Build ID.
176
177Presentation elements
178=====================
179
180These are elements that convey a specific program entity to be displayed in
181human-readable symbolic form.
182
183``{{{symbol:%s}}}``
184  Here ``%s`` is the linkage name for a symbol or type. It may require
185  demangling according to language ABI rules. Even for unmangled names, it's
186  recommended that this markup element be used to identify a symbol name so that
187  it can be presented distinctively.
188
189  Examples::
190
191    {{{symbol:_ZN7Mangled4NameEv}}}
192    {{{symbol:foobar}}}
193
194``{{{pc:%p}}}``, ``{{{pc:%p:ra}}}``, ``{{{pc:%p:pc}}}``
195
196  Here ``%p`` is the memory address of a code location. It might be presented as a
197  function name and source location. The second two forms distinguish the kind of
198  code location, as described in detail for bt elements below.
199
200  Examples::
201
202    {{{pc:0x12345678}}}
203    {{{pc:0xffffffff9abcdef0}}}
204
205``{{{data:%p}}}``
206
207  Here ``%p`` is the memory address of a data location. It might be presented as
208  the name of a global variable at that location.
209
210  Examples::
211
212    {{{data:0x12345678}}}
213    {{{data:0xffffffff9abcdef0}}}
214
215``{{{bt:%u:%p}}}``, ``{{{bt:%u:%p:ra}}}``, ``{{{bt:%u:%p:pc}}}``
216
217  This represents one frame in a backtrace. It usually appears on a line by
218  itself (surrounded only by whitespace), in a sequence of such lines with
219  ascending frame numbers. So the human-readable output might be formatted
220  assuming that, such that it looks good for a sequence of bt elements each
221  alone on its line with uniform indentation of each line. But it can appear
222  anywhere, so the filter should not remove any non-whitespace text surrounding
223  the element.
224
225  Here ``%u`` is the frame number, which starts at zero for the location of the
226  fault being identified, increments to one for the caller of frame zero's call
227  frame, to two for the caller of frame one, etc. ``%p`` is the memory address
228  of a code location.
229
230  Code locations in a backtrace come from two distinct sources. Most backtrace
231  frames describe a return address code location, i.e. the instruction
232  immediately after a call instruction. This is the location of code that has
233  yet to run, since the function called there has not yet returned. Hence the
234  code location of actual interest is usually the call site itself rather than
235  the return address, i.e. one instruction earlier. When presenting the source
236  location for a return address frame, the symbolizing filter will subtract one
237  byte or one instruction length from the actual return address for the call
238  site, with the intent that the address logged can be translated directly to a
239  source location for the call site and not for the apparent return site
240  thereafter (which can be confusing).  When inlined functions are involved, the
241  call site and the return site can appear to be in different functions at
242  entirely unrelated source locations rather than just a line away, making the
243  confusion of showing the return site rather the call site quite severe.
244
245  Often the first frame in a backtrace ("frame zero") identifies the precise
246  code location of a fault, trap, or asynchronous interrupt rather than a return
247  address. At other times, even the first frame is actually a return address
248  (for example, backtraces collected at the time of an object allocation and
249  reported later when the allocated object is used or misused). When a system
250  supports in-thread trap handling, there may also be frames after the first
251  that represent a precise interrupted code location rather than a return
252  address, presented as the "caller" of a trap handler function (for example,
253  signal handlers in POSIX systems).
254
255  Return address frames are identified by the ``:ra`` suffix. Precise code
256  location frames are identified by the ``:pc`` suffix.
257
258  Traditional practice has often been to collect backtraces as simple address
259  lists, losing the distinction between return address code locations and
260  precise code locations. Some such code applies the "subtract one" adjustment
261  described above to the address values before reporting them, and it's not
262  always clear or consistent whether this adjustment has been applied or not.
263  These ambiguous cases are supported by the ``bt`` and ``pc`` forms with no
264  ``:ra`` or ``:pc`` suffix, which indicate it's unclear which sort of code
265  location this is.  However, it's highly recommended that all emitters use the
266  suffixed forms and deliver address values with no adjustments applied. When
267  traditional practice has been ambiguous, the majority of cases seem to have
268  been of printing addresses that are return address code locations and printing
269  them without adjustment. So the symbolizing filter will usually apply the
270  "subtract one byte" adjustment to an address printed without a disambiguating
271  suffix. Assuming that a call instruction is longer than one byte on all
272  supported machines, applying the "subtract one byte" adjustment a second time
273  still results in an address somewhere in the call instruction, so a little
274  sloppiness here often does little or no harm.
275
276  Examples::
277
278    {{{bt:0:0x12345678:pc}}}
279    {{{bt:1:0xffffffff9abcdef0:ra}}}
280
281``{{{hexdict:...}}}`` [#not_yet_implemented]_
282
283  This element can span multiple lines. Here ``...`` is a sequence of key-value
284  pairs where a single ``:`` separates each key from its value, and arbitrary
285  whitespace separates the pairs. The value (right-hand side) of each pair
286  either is one or more ``0`` digits, or is ``0x`` followed by hexadecimal
287  digits. Each value might be a memory address or might be some other integer
288  (including an integer that looks like a likely memory address but actually has
289  an unrelated purpose). When the contextual information about the memory layout
290  suggests that a given value could be a code location or a global variable data
291  address, it might be presented as a source location or variable name or with
292  active UI that makes such interpretation optionally visible.
293
294  The intended use is for things like register dumps, where the emitter doesn't
295  know which values might have a symbolic interpretation but a presentation that
296  makes plausible symbolic interpretations available might be very useful to
297  someone reading the log. At the same time, a flat text presentation should
298  usually avoid interfering too much with the original contents and formatting
299  of the dump. For example, it might use footnotes with source locations for
300  values that appear to be code locations. An active UI presentation might show
301  the dump text as is, but highlight values with symbolic information available
302  and pop up a presentation of symbolic details when a value is selected.
303
304  Example::
305
306    {{{hexdict:
307        CS:                   0 RIP:     0x6ee17076fb80 EFL:            0x10246 CR2:                  0
308        RAX:      0xc53d0acbcf0 RBX:     0x1e659ea7e0d0 RCX:                  0 RDX:     0x6ee1708300cc
309        RSI:                  0 RDI:     0x6ee170830040 RBP:     0x3b13734898e0 RSP:     0x3b13734898d8
310        R8:      0x3b1373489860 R9:          0x2776ff4f R10:     0x2749d3e9a940 R11:              0x246
311        R12:     0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14:     0x1e659ea7e108 R15:      0xc53d0acbcf0
312      }}}
313
314Trigger elements
315================
316
317These elements cause an external action and will be presented to the user in a
318human readable form. Generally they trigger an external action to occur that
319results in a linkable page. The link or some other informative information about
320the external action can then be presented to the user.
321
322``{{{dumpfile:%s:%s}}}`` [#not_yet_implemented]_
323
324  Here the first ``%s`` is an identifier for a type of dump and the second
325  ``%s`` is an identifier for a particular dump that's just been published. The
326  types of dumps, the exact meaning of "published", and the nature of the
327  identifier are outside the scope of the markup format per se. In general it
328  might correspond to writing a file by that name or something similar.
329
330  This element may trigger additional post-processing work beyond symbolizing
331  the markup. It indicates that a dump file of some sort has been published.
332  Some logic attached to the symbolizing filter may understand certain types of
333  dump file and trigger additional post-processing of the dump file upon
334  encountering this element (e.g. generating visualizations, symbolization). The
335  expectation is that the information collected from contextual elements
336  (described below) in the logging stream may be necessary to decode the content
337  of the dump. So if the symbolizing filter triggers other processing, it may
338  need to feed some distilled form of the contextual information to those
339  processes.
340
341  An example of a type identifier is ``sancov``, for dumps from LLVM
342  `SanitizerCoverage <https://clang.llvm.org/docs/SanitizerCoverage.html>`_.
343
344  Example::
345
346    {{{dumpfile:sancov:sancov.8675}}}
347
348Contextual elements
349===================
350
351These are elements that supply information necessary to convert presentation
352elements to symbolic form. Unlike presentation elements, they are not directly
353related to the surrounding text. Contextual elements should appear alone on
354lines with no other non-whitespace text, so that the symbolizing filter might
355elide the whole line from its output without hiding any other log text.
356
357The contextual elements themselves do not necessarily need to be presented in
358human-readable output. However, the information they impart may be essential to
359understanding the logging text even after symbolization. So it's recommended
360that this information be preserved in some form when the original raw log with
361markup may no longer be readily accessible for whatever reason.
362
363Contextual elements should appear in the logging stream before they are needed.
364That is, if some piece of context may affect how the symbolizing filter would
365interpret or present a later presentation element, the necessary contextual
366elements should have appeared somewhere earlier in the logging stream. It should
367always be possible for the symbolizing filter to be implemented as a single pass
368over the raw logging stream, accumulating context and massaging text as it goes.
369
370``{{{reset}}}``
371
372  This should be output before any other contextual element. The need for this
373  contextual element is to support implementations that handle logs coming from
374  multiple processes. Such implementations might not know when a new process
375  starts or ends. Because some identifying information (like process IDs) might
376  be the same between old and new processes, a way is needed to distinguish two
377  processes with such identical identifying information. This element informs
378  such implementations to reset the state of a filter so that information from a
379  previous process's contextual elements is not assumed for new process that
380  just happens have the same identifying information.
381
382``{{{module:%i:%s:%s:...}}}``
383
384  This element represents a so-called "module". A "module" is a single linked
385  binary, such as a loaded ELF file. Usually each module occupies a contiguous
386  range of memory.
387
388  Here ``%i`` is the module ID which is used by other contextual elements to
389  refer to this module. The first ``%s`` is a human-readable identifier for the
390  module, such as an ELF ``DT_SONAME`` string or a file name; but it might be
391  empty. It's only for casual information. Only the module ID is used to refer
392  to this module in other contextual elements, never the ``%s`` string. The
393  ``module`` element defining a module ID must always be emitted before any
394  other elements that refer to that module ID, so that a filter never needs to
395  keep track of dangling references. The second ``%s`` is the module type and it
396  determines what the remaining fields are. The following module types are
397  supported:
398
399  * ``elf:%x``
400
401  Here ``%x`` encodes an ELF Build ID. The Build ID should refer to a single
402  linked binary. The Build ID string is the sole way to identify the binary from
403  which this module was loaded.
404
405  Example::
406
407    {{{module:1:libc.so:elf:83238ab56ba10497}}}
408
409``{{{mmap:%p:%i:...}}}``
410
411  This contextual element is used to give information about a particular region
412  in memory. ``%p`` is the starting address and ``%i`` gives the size in hex of the
413  region of memory. The ``...`` part can take different forms to give different
414  information about the specified region of memory. The allowed forms are the
415  following:
416
417  * ``load:%i:%s:%p``
418
419  This subelement informs the filter that a segment was loaded from a module.
420  The module is identified by its module ID ``%i``. The ``%s`` is one or more of
421  the letters 'r', 'w', and 'x' (in that order and in either upper or lower
422  case) to indicate this segment of memory is readable, writable, and/or
423  executable. The symbolizing filter can use this information to guess whether
424  an address is a likely code address or a likely data address in the given
425  module. The remaining ``%p`` gives the module relative address. For ELF files
426  the module relative address will be the ``p_vaddr`` of the associated program
427  header. For example if your module's executable segment has
428  ``p_vaddr=0x1000``, ``p_memsz=0x1234``, and was loaded at ``0x7acba69d5000``
429  then you need to subtract ``0x7acba69d4000`` from any address between
430  ``0x7acba69d5000`` and ``0x7acba69d6234`` to get the module relative address.
431  The starting address will usually have been rounded down to the active page
432  size, and the size rounded up.
433
434  Example::
435
436    {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
437
438.. rubric:: Footnotes
439
440.. [#not_yet_implemented] This markup element is not yet implemented in
441  :doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>`.
442