xref: /openbsd-src/gnu/llvm/clang/docs/DataFlowSanitizer.rst (revision 12c855180aad702bbcca06e0398d774beeafb155)
1=================
2DataFlowSanitizer
3=================
4
5.. toctree::
6   :hidden:
7
8   DataFlowSanitizerDesign
9
10.. contents::
11   :local:
12
13Introduction
14============
15
16DataFlowSanitizer is a generalised dynamic data flow analysis.
17
18Unlike other Sanitizer tools, this tool is not designed to detect a
19specific class of bugs on its own.  Instead, it provides a generic
20dynamic data flow analysis framework to be used by clients to help
21detect application-specific issues within their own code.
22
23How to build libc++ with DFSan
24==============================
25
26DFSan requires either all of your code to be instrumented or for uninstrumented
27functions to be listed as ``uninstrumented`` in the `ABI list`_.
28
29If you'd like to have instrumented libc++ functions, then you need to build it
30with DFSan instrumentation from source. Here is an example of how to build
31libc++ and the libc++ ABI with data flow sanitizer instrumentation.
32
33.. code-block:: console
34
35  mkdir libcxx-build
36  cd libcxx-build
37
38  # An example using ninja
39  cmake -GNinja -S <monorepo-root>/runtimes \
40    -DCMAKE_C_COMPILER=clang \
41    -DCMAKE_CXX_COMPILER=clang++ \
42    -DLLVM_USE_SANITIZER="DataFlow" \
43    -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi"
44
45  ninja cxx cxxabi
46
47Note: Ensure you are building with a sufficiently new version of Clang.
48
49Usage
50=====
51
52With no program changes, applying DataFlowSanitizer to a program
53will not alter its behavior.  To use DataFlowSanitizer, the program
54uses API functions to apply tags to data to cause it to be tracked, and to
55check the tag of a specific data item.  DataFlowSanitizer manages
56the propagation of tags through the program according to its data flow.
57
58The APIs are defined in the header file ``sanitizer/dfsan_interface.h``.
59For further information about each function, please refer to the header
60file.
61
62.. _ABI list:
63
64ABI List
65--------
66
67DataFlowSanitizer uses a list of functions known as an ABI list to decide
68whether a call to a specific function should use the operating system's native
69ABI or whether it should use a variant of this ABI that also propagates labels
70through function parameters and return values.  The ABI list file also controls
71how labels are propagated in the former case.  DataFlowSanitizer comes with a
72default ABI list which is intended to eventually cover the glibc library on
73Linux but it may become necessary for users to extend the ABI list in cases
74where a particular library or function cannot be instrumented (e.g. because
75it is implemented in assembly or another language which DataFlowSanitizer does
76not support) or a function is called from a library or function which cannot
77be instrumented.
78
79DataFlowSanitizer's ABI list file is a :doc:`SanitizerSpecialCaseList`.
80The pass treats every function in the ``uninstrumented`` category in the
81ABI list file as conforming to the native ABI.  Unless the ABI list contains
82additional categories for those functions, a call to one of those functions
83will produce a warning message, as the labelling behavior of the function
84is unknown.  The other supported categories are ``discard``, ``functional``
85and ``custom``.
86
87* ``discard`` -- To the extent that this function writes to (user-accessible)
88  memory, it also updates labels in shadow memory (this condition is trivially
89  satisfied for functions which do not write to user-accessible memory).  Its
90  return value is unlabelled.
91* ``functional`` -- Like ``discard``, except that the label of its return value
92  is the union of the label of its arguments.
93* ``custom`` -- Instead of calling the function, a custom wrapper ``__dfsw_F``
94  is called, where ``F`` is the name of the function.  This function may wrap
95  the original function or provide its own implementation.  This category is
96  generally used for uninstrumentable functions which write to user-accessible
97  memory or which have more complex label propagation behavior.  The signature
98  of ``__dfsw_F`` is based on that of ``F`` with each argument having a
99  label of type ``dfsan_label`` appended to the argument list.  If ``F``
100  is of non-void return type a final argument of type ``dfsan_label *``
101  is appended to which the custom function can store the label for the
102  return value.  For example:
103
104.. code-block:: c++
105
106  void f(int x);
107  void __dfsw_f(int x, dfsan_label x_label);
108
109  void *memcpy(void *dest, const void *src, size_t n);
110  void *__dfsw_memcpy(void *dest, const void *src, size_t n,
111                      dfsan_label dest_label, dfsan_label src_label,
112                      dfsan_label n_label, dfsan_label *ret_label);
113
114If a function defined in the translation unit being compiled belongs to the
115``uninstrumented`` category, it will be compiled so as to conform to the
116native ABI.  Its arguments will be assumed to be unlabelled, but it will
117propagate labels in shadow memory.
118
119For example:
120
121.. code-block:: none
122
123  # main is called by the C runtime using the native ABI.
124  fun:main=uninstrumented
125  fun:main=discard
126
127  # malloc only writes to its internal data structures, not user-accessible memory.
128  fun:malloc=uninstrumented
129  fun:malloc=discard
130
131  # tolower is a pure function.
132  fun:tolower=uninstrumented
133  fun:tolower=functional
134
135  # memcpy needs to copy the shadow from the source to the destination region.
136  # This is done in a custom function.
137  fun:memcpy=uninstrumented
138  fun:memcpy=custom
139
140For instrumented functions, the ABI list supports a ``force_zero_labels``
141category, which will make all stores and return values set zero labels.
142Functions should never be labelled with both ``force_zero_labels``
143and ``uninstrumented`` or any of the unistrumented wrapper kinds.
144
145For example:
146
147.. code-block:: none
148
149  # e.g. void writes_data(char* out_buf, int out_buf_len) {...}
150  # Applying force_zero_labels will force out_buf shadow to zero.
151  fun:writes_data=force_zero_labels
152
153
154Compilation Flags
155-----------------
156
157* ``-dfsan-abilist`` -- The additional ABI list files that control how shadow
158  parameters are passed. File names are separated by comma.
159* ``-dfsan-combine-pointer-labels-on-load`` -- Controls whether to include or
160  ignore the labels of pointers in load instructions. Its default value is true.
161  For example:
162
163.. code-block:: c++
164
165  v = *p;
166
167If the flag is true, the label of ``v`` is the union of the label of ``p`` and
168the label of ``*p``. If the flag is false, the label of ``v`` is the label of
169just ``*p``.
170
171* ``-dfsan-combine-pointer-labels-on-store`` -- Controls whether to include or
172  ignore the labels of pointers in store instructions. Its default value is
173  false. For example:
174
175.. code-block:: c++
176
177  *p = v;
178
179If the flag is true, the label of ``*p`` is the union of the label of ``p`` and
180the label of ``v``. If the flag is false, the label of ``*p`` is the label of
181just ``v``.
182
183* ``-dfsan-combine-offset-labels-on-gep`` -- Controls whether to propagate
184  labels of offsets in GEP instructions. Its default value is true. For example:
185
186.. code-block:: c++
187
188  p += i;
189
190If the flag is true, the label of ``p`` is the union of the label of ``p`` and
191the label of ``i``. If the flag is false, the label of ``p`` is unchanged.
192
193* ``-dfsan-track-select-control-flow`` -- Controls whether to track the control
194  flow of select instructions. Its default value is true. For example:
195
196.. code-block:: c++
197
198  v = b? v1: v2;
199
200If the flag is true, the label of ``v`` is the union of the labels of ``b``,
201``v1`` and ``v2``.  If the flag is false, the label of ``v`` is the union of the
202labels of just ``v1`` and ``v2``.
203
204* ``-dfsan-event-callbacks`` -- An experimental feature that inserts callbacks for
205  certain data events. Currently callbacks are only inserted for loads, stores,
206  memory transfers (i.e. memcpy and memmove), and comparisons. Its default value
207  is false. If this flag is set to true, a user must provide definitions for the
208  following callback functions:
209
210.. code-block:: c++
211
212  void __dfsan_load_callback(dfsan_label Label, void* Addr);
213  void __dfsan_store_callback(dfsan_label Label, void* Addr);
214  void __dfsan_mem_transfer_callback(dfsan_label *Start, size_t Len);
215  void __dfsan_cmp_callback(dfsan_label CombinedLabel);
216
217* ``-dfsan-conditional-callbacks`` -- An experimental feature that inserts
218  callbacks for control flow conditional expressions.
219  This can be used to find where tainted values can control execution.
220
221  In addition to this compilation flag, a callback handler must be registered
222  using ``dfsan_set_conditional_callback(my_callback);``, where my_callback is
223  a function with a signature matching
224  ``void my_callback(dfsan_label l, dfsan_origin o);``.
225  This signature is the same when origin tracking is disabled - in this case
226  the dfsan_origin passed in it will always be 0.
227
228  The callback will only be called when a tainted value reaches a conditional
229  expression for control flow (such as an if's condition).
230  The callback will be skipped for conditional expressions inside signal
231  handlers, as this is prone to deadlock. Tainted values used in conditional
232  expressions inside signal handlers will instead be aggregated via bitwise
233  or, and can be accessed using
234  ``dfsan_label dfsan_get_labels_in_signal_conditional();``.
235
236* ``-dfsan-track-origins`` -- Controls how to track origins. When its value is
237  0, the runtime does not track origins. When its value is 1, the runtime tracks
238  origins at memory store operations. When its value is 2, the runtime tracks
239  origins at memory load and store operations. Its default value is 0.
240
241* ``-dfsan-instrument-with-call-threshold`` -- If a function being instrumented
242  requires more than this number of origin stores, use callbacks instead of
243  inline checks (-1 means never use callbacks). Its default value is 3500.
244
245Environment Variables
246---------------------
247
248* ``warn_unimplemented`` -- Whether to warn on unimplemented functions. Its
249  default value is false.
250* ``strict_data_dependencies`` -- Whether to propagate labels only when there is
251  explicit obvious data dependency (e.g., when comparing strings, ignore the fact
252  that the output of the comparison might be implicit data-dependent on the
253  content of the strings). This applies only to functions with ``custom`` category
254  in ABI list. Its default value is true.
255* ``origin_history_size`` -- The limit of origin chain length. Non-positive values
256  mean unlimited. Its default value is 16.
257* ``origin_history_per_stack_limit`` -- The limit of origin node's references count.
258  Non-positive values mean unlimited. Its default value is 20000.
259* ``store_context_size`` -- The depth limit of origin tracking stack traces. Its
260  default value is 20.
261* ``zero_in_malloc`` -- Whether to zero shadow space of new allocated memory. Its
262  default value is true.
263* ``zero_in_free`` --- Whether to zero shadow space of deallocated memory. Its
264  default value is true.
265
266Example
267=======
268
269DataFlowSanitizer supports up to 8 labels, to achieve low CPU and code
270size overhead. Base labels are simply 8-bit unsigned integers that are
271powers of 2 (i.e. 1, 2, 4, 8, ..., 128), and union labels are created
272by ORing base labels.
273
274The following program demonstrates label propagation by checking that
275the correct labels are propagated.
276
277.. code-block:: c++
278
279  #include <sanitizer/dfsan_interface.h>
280  #include <assert.h>
281
282  int main(void) {
283    int i = 100;
284    int j = 200;
285    int k = 300;
286    dfsan_label i_label = 1;
287    dfsan_label j_label = 2;
288    dfsan_label k_label = 4;
289    dfsan_set_label(i_label, &i, sizeof(i));
290    dfsan_set_label(j_label, &j, sizeof(j));
291    dfsan_set_label(k_label, &k, sizeof(k));
292
293    dfsan_label ij_label = dfsan_get_label(i + j);
294
295    assert(ij_label & i_label);  // ij_label has i_label
296    assert(ij_label & j_label);  // ij_label has j_label
297    assert(!(ij_label & k_label));  // ij_label doesn't have k_label
298    assert(ij_label == 3);  // Verifies all of the above
299
300    // Or, equivalently:
301    assert(dfsan_has_label(ij_label, i_label));
302    assert(dfsan_has_label(ij_label, j_label));
303    assert(!dfsan_has_label(ij_label, k_label));
304
305    dfsan_label ijk_label = dfsan_get_label(i + j + k);
306
307    assert(ijk_label & i_label);  // ijk_label has i_label
308    assert(ijk_label & j_label);  // ijk_label has j_label
309    assert(ijk_label & k_label);  // ijk_label has k_label
310    assert(ijk_label == 7);  // Verifies all of the above
311
312    // Or, equivalently:
313    assert(dfsan_has_label(ijk_label, i_label));
314    assert(dfsan_has_label(ijk_label, j_label));
315    assert(dfsan_has_label(ijk_label, k_label));
316
317    return 0;
318  }
319
320Origin Tracking
321===============
322
323DataFlowSanitizer can track origins of labeled values. This feature is enabled by
324``-mllvm -dfsan-track-origins=1``. For example,
325
326.. code-block:: console
327
328    % cat test.cc
329    #include <sanitizer/dfsan_interface.h>
330    #include <stdio.h>
331
332    int main(int argc, char** argv) {
333      int i = 0;
334      dfsan_set_label(i_label, &i, sizeof(i));
335      int j = i + 1;
336      dfsan_print_origin_trace(&j, "A flow from i to j");
337      return 0;
338    }
339
340    % clang++ -fsanitize=dataflow -mllvm -dfsan-track-origins=1 -fno-omit-frame-pointer -g -O2 test.cc
341    % ./a.out
342    Taint value 0x1 (at 0x7ffd42bf415c) origin tracking (A flow from i to j)
343    Origin value: 0x13900001, Taint value was stored to memory at
344      #0 0x55676db85a62 in main test.cc:7:7
345      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
346
347    Origin value: 0x9e00001, Taint value was created at
348      #0 0x55676db85a08 in main test.cc:6:3
349      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
350
351By ``-mllvm -dfsan-track-origins=1`` DataFlowSanitizer collects only
352intermediate stores a labeled value went through. Origin tracking slows down
353program execution by a factor of 2x on top of the usual DataFlowSanitizer
354slowdown and increases memory overhead by 1x. By ``-mllvm -dfsan-track-origins=2``
355DataFlowSanitizer also collects intermediate loads a labeled value went through.
356This mode slows down program execution by a factor of 4x.
357
358Current status
359==============
360
361DataFlowSanitizer is a work in progress, currently under development for
362x86\_64 Linux.
363
364Design
365======
366
367Please refer to the :doc:`design document<DataFlowSanitizerDesign>`.
368