xref: /llvm-project/clang/docs/DataFlowSanitizer.rst (revision 9405d5af65853ac548cce2656497195010db1d86)
1=================
2DataFlowSanitizer
3=================
4
5.. toctree::
6   :hidden:
7
8   DataFlowSanitizerDesign
9
10.. contents::
11   :local:
12
13Introduction
14============
15
16DataFlowSanitizer is a generalised dynamic data flow analysis.
17
18Unlike other Sanitizer tools, this tool is not designed to detect a
19specific class of bugs on its own.  Instead, it provides a generic
20dynamic data flow analysis framework to be used by clients to help
21detect application-specific issues within their own code.
22
23How to build libc++ with DFSan
24==============================
25
26DFSan requires either all of your code to be instrumented or for uninstrumented
27functions to be listed as ``uninstrumented`` in the `ABI list`_.
28
29If you'd like to have instrumented libc++ functions, then you need to build it
30with DFSan instrumentation from source. Here is an example of how to build
31libc++ and the libc++ ABI with data flow sanitizer instrumentation.
32
33.. code-block:: console
34
35  mkdir libcxx-build
36  cd libcxx-build
37
38  # An example using ninja
39  cmake -GNinja -S <monorepo-root>/runtimes \
40    -DCMAKE_C_COMPILER=clang \
41    -DCMAKE_CXX_COMPILER=clang++ \
42    -DLLVM_USE_SANITIZER="DataFlow" \
43    -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi"
44
45  ninja cxx cxxabi
46
47Note: Ensure you are building with a sufficiently new version of Clang.
48
49Usage
50=====
51
52With no program changes, applying DataFlowSanitizer to a program
53will not alter its behavior.  To use DataFlowSanitizer, the program
54uses API functions to apply tags to data to cause it to be tracked, and to
55check the tag of a specific data item.  DataFlowSanitizer manages
56the propagation of tags through the program according to its data flow.
57
58The APIs are defined in the header file ``sanitizer/dfsan_interface.h``.
59For further information about each function, please refer to the header
60file.
61
62.. _ABI list:
63
64ABI List
65--------
66
67DataFlowSanitizer uses a list of functions known as an ABI list to decide
68whether a call to a specific function should use the operating system's native
69ABI or whether it should use a variant of this ABI that also propagates labels
70through function parameters and return values.  The ABI list file also controls
71how labels are propagated in the former case.  DataFlowSanitizer comes with a
72default ABI list which is intended to eventually cover the glibc library on
73Linux but it may become necessary for users to extend the ABI list in cases
74where a particular library or function cannot be instrumented (e.g. because
75it is implemented in assembly or another language which DataFlowSanitizer does
76not support) or a function is called from a library or function which cannot
77be instrumented.
78
79DataFlowSanitizer's ABI list file is a :doc:`SanitizerSpecialCaseList`.
80The pass treats every function in the ``uninstrumented`` category in the
81ABI list file as conforming to the native ABI.  Unless the ABI list contains
82additional categories for those functions, a call to one of those functions
83will produce a warning message, as the labelling behavior of the function
84is unknown.  The other supported categories are ``discard``, ``functional``
85and ``custom``.
86
87* ``discard`` -- To the extent that this function writes to (user-accessible)
88  memory, it also updates labels in shadow memory (this condition is trivially
89  satisfied for functions which do not write to user-accessible memory).  Its
90  return value is unlabelled.
91* ``functional`` -- Like ``discard``, except that the label of its return value
92  is the union of the label of its arguments.
93* ``custom`` -- Instead of calling the function, a custom wrapper ``__dfsw_F``
94  is called, where ``F`` is the name of the function.  This function may wrap
95  the original function or provide its own implementation.  This category is
96  generally used for uninstrumentable functions which write to user-accessible
97  memory or which have more complex label propagation behavior.  The signature
98  of ``__dfsw_F`` is based on that of ``F`` with each argument having a
99  label of type ``dfsan_label`` appended to the argument list.  If ``F``
100  is of non-void return type a final argument of type ``dfsan_label *``
101  is appended to which the custom function can store the label for the
102  return value.  For example:
103
104.. code-block:: c++
105
106  void f(int x);
107  void __dfsw_f(int x, dfsan_label x_label);
108
109  void *memcpy(void *dest, const void *src, size_t n);
110  void *__dfsw_memcpy(void *dest, const void *src, size_t n,
111                      dfsan_label dest_label, dfsan_label src_label,
112                      dfsan_label n_label, dfsan_label *ret_label);
113
114If a function defined in the translation unit being compiled belongs to the
115``uninstrumented`` category, it will be compiled so as to conform to the
116native ABI.  Its arguments will be assumed to be unlabelled, but it will
117propagate labels in shadow memory.
118
119For example:
120
121.. code-block:: none
122
123  # main is called by the C runtime using the native ABI.
124  fun:main=uninstrumented
125  fun:main=discard
126
127  # malloc only writes to its internal data structures, not user-accessible memory.
128  fun:malloc=uninstrumented
129  fun:malloc=discard
130
131  # tolower is a pure function.
132  fun:tolower=uninstrumented
133  fun:tolower=functional
134
135  # memcpy needs to copy the shadow from the source to the destination region.
136  # This is done in a custom function.
137  fun:memcpy=uninstrumented
138  fun:memcpy=custom
139
140For instrumented functions, the ABI list supports a ``force_zero_labels``
141category, which will make all stores and return values set zero labels.
142Functions should never be labelled with both ``force_zero_labels``
143and ``uninstrumented`` or any of the uninstrumented wrapper kinds.
144
145For example:
146
147.. code-block:: none
148
149  # e.g. void writes_data(char* out_buf, int out_buf_len) {...}
150  # Applying force_zero_labels will force out_buf shadow to zero.
151  fun:writes_data=force_zero_labels
152
153
154Compilation Flags
155-----------------
156
157* ``-dfsan-abilist`` -- The additional ABI list files that control how shadow
158  parameters are passed. File names are separated by comma.
159* ``-dfsan-combine-pointer-labels-on-load`` -- Controls whether to include or
160  ignore the labels of pointers in load instructions. Its default value is true.
161  For example:
162
163.. code-block:: c++
164
165  v = *p;
166
167If the flag is true, the label of ``v`` is the union of the label of ``p`` and
168the label of ``*p``. If the flag is false, the label of ``v`` is the label of
169just ``*p``.
170
171* ``-dfsan-combine-pointer-labels-on-store`` -- Controls whether to include or
172  ignore the labels of pointers in store instructions. Its default value is
173  false. For example:
174
175.. code-block:: c++
176
177  *p = v;
178
179If the flag is true, the label of ``*p`` is the union of the label of ``p`` and
180the label of ``v``. If the flag is false, the label of ``*p`` is the label of
181just ``v``.
182
183* ``-dfsan-combine-offset-labels-on-gep`` -- Controls whether to propagate
184  labels of offsets in GEP instructions. Its default value is true. For example:
185
186.. code-block:: c++
187
188  p += i;
189
190If the flag is true, the label of ``p`` is the union of the label of ``p`` and
191the label of ``i``. If the flag is false, the label of ``p`` is unchanged.
192
193* ``-dfsan-track-select-control-flow`` -- Controls whether to track the control
194  flow of select instructions. Its default value is true. For example:
195
196.. code-block:: c++
197
198  v = b? v1: v2;
199
200If the flag is true, the label of ``v`` is the union of the labels of ``b``,
201``v1`` and ``v2``.  If the flag is false, the label of ``v`` is the union of the
202labels of just ``v1`` and ``v2``.
203
204* ``-dfsan-event-callbacks`` -- An experimental feature that inserts callbacks for
205  certain data events. Currently callbacks are only inserted for loads, stores,
206  memory transfers (i.e. memcpy and memmove), and comparisons. Its default value
207  is false. If this flag is set to true, a user must provide definitions for the
208  following callback functions:
209
210.. code-block:: c++
211
212  void __dfsan_load_callback(dfsan_label Label, void* Addr);
213  void __dfsan_store_callback(dfsan_label Label, void* Addr);
214  void __dfsan_mem_transfer_callback(dfsan_label *Start, size_t Len);
215  void __dfsan_cmp_callback(dfsan_label CombinedLabel);
216
217* ``-dfsan-conditional-callbacks`` -- An experimental feature that inserts
218  callbacks for control flow conditional expressions.
219  This can be used to find where tainted values can control execution.
220
221  In addition to this compilation flag, a callback handler must be registered
222  using ``dfsan_set_conditional_callback(my_callback);``, where my_callback is
223  a function with a signature matching
224  ``void my_callback(dfsan_label l, dfsan_origin o);``.
225  This signature is the same when origin tracking is disabled - in this case
226  the dfsan_origin passed in it will always be 0.
227
228  The callback will only be called when a tainted value reaches a conditional
229  expression for control flow (such as an if's condition).
230  The callback will be skipped for conditional expressions inside signal
231  handlers, as this is prone to deadlock. Tainted values used in conditional
232  expressions inside signal handlers will instead be aggregated via bitwise
233  or, and can be accessed using
234  ``dfsan_label dfsan_get_labels_in_signal_conditional();``.
235
236* ``-dfsan-reaches-function-callbacks`` -- An experimental feature that inserts
237  callbacks for data entering a function.
238
239  In addition to this compilation flag, a callback handler must be registered
240  using ``dfsan_set_reaches_function_callback(my_callback);``, where my_callback is
241  a function with a signature matching
242  ``void my_callback(dfsan_label label, dfsan_origin origin, const char *file, unsigned int line, const char *function);``
243  This signature is the same when origin tracking is disabled - in this case
244  the dfsan_origin passed in it will always be 0.
245
246  The callback will be called when a tained value reach stack/registers
247  in the context of a function. Tainted values can reach a function:
248  * via the arguments of the function
249  * via the return value of a call that occurs in the function
250  * via the loaded value of a load that occurs in the function
251
252  The callback will be skipped for conditional expressions inside signal
253  handlers, as this is prone to deadlock. Tainted values reaching functions
254  inside signal handlers will instead be aggregated via bitwise or, and can
255  be accessed using
256  ``dfsan_label dfsan_get_labels_in_signal_reaches_function()``.
257
258* ``-dfsan-track-origins`` -- Controls how to track origins. When its value is
259  0, the runtime does not track origins. When its value is 1, the runtime tracks
260  origins at memory store operations. When its value is 2, the runtime tracks
261  origins at memory load and store operations. Its default value is 0.
262
263* ``-dfsan-instrument-with-call-threshold`` -- If a function being instrumented
264  requires more than this number of origin stores, use callbacks instead of
265  inline checks (-1 means never use callbacks). Its default value is 3500.
266
267Environment Variables
268---------------------
269
270* ``warn_unimplemented`` -- Whether to warn on unimplemented functions. Its
271  default value is false.
272* ``strict_data_dependencies`` -- Whether to propagate labels only when there is
273  explicit obvious data dependency (e.g., when comparing strings, ignore the fact
274  that the output of the comparison might be implicit data-dependent on the
275  content of the strings). This applies only to functions with ``custom`` category
276  in ABI list. Its default value is true.
277* ``origin_history_size`` -- The limit of origin chain length. Non-positive values
278  mean unlimited. Its default value is 16.
279* ``origin_history_per_stack_limit`` -- The limit of origin node's references count.
280  Non-positive values mean unlimited. Its default value is 20000.
281* ``store_context_size`` -- The depth limit of origin tracking stack traces. Its
282  default value is 20.
283* ``zero_in_malloc`` -- Whether to zero shadow space of new allocated memory. Its
284  default value is true.
285* ``zero_in_free`` --- Whether to zero shadow space of deallocated memory. Its
286  default value is true.
287
288Example
289=======
290
291DataFlowSanitizer supports up to 8 labels, to achieve low CPU and code
292size overhead. Base labels are simply 8-bit unsigned integers that are
293powers of 2 (i.e. 1, 2, 4, 8, ..., 128), and union labels are created
294by ORing base labels.
295
296The following program demonstrates label propagation by checking that
297the correct labels are propagated.
298
299.. code-block:: c++
300
301  #include <sanitizer/dfsan_interface.h>
302  #include <assert.h>
303
304  int main(void) {
305    int i = 100;
306    int j = 200;
307    int k = 300;
308    dfsan_label i_label = 1;
309    dfsan_label j_label = 2;
310    dfsan_label k_label = 4;
311    dfsan_set_label(i_label, &i, sizeof(i));
312    dfsan_set_label(j_label, &j, sizeof(j));
313    dfsan_set_label(k_label, &k, sizeof(k));
314
315    dfsan_label ij_label = dfsan_get_label(i + j);
316
317    assert(ij_label & i_label);  // ij_label has i_label
318    assert(ij_label & j_label);  // ij_label has j_label
319    assert(!(ij_label & k_label));  // ij_label doesn't have k_label
320    assert(ij_label == 3);  // Verifies all of the above
321
322    // Or, equivalently:
323    assert(dfsan_has_label(ij_label, i_label));
324    assert(dfsan_has_label(ij_label, j_label));
325    assert(!dfsan_has_label(ij_label, k_label));
326
327    dfsan_label ijk_label = dfsan_get_label(i + j + k);
328
329    assert(ijk_label & i_label);  // ijk_label has i_label
330    assert(ijk_label & j_label);  // ijk_label has j_label
331    assert(ijk_label & k_label);  // ijk_label has k_label
332    assert(ijk_label == 7);  // Verifies all of the above
333
334    // Or, equivalently:
335    assert(dfsan_has_label(ijk_label, i_label));
336    assert(dfsan_has_label(ijk_label, j_label));
337    assert(dfsan_has_label(ijk_label, k_label));
338
339    return 0;
340  }
341
342Origin Tracking
343===============
344
345DataFlowSanitizer can track origins of labeled values. This feature is enabled by
346``-mllvm -dfsan-track-origins=1``. For example,
347
348.. code-block:: console
349
350    % cat test.cc
351    #include <sanitizer/dfsan_interface.h>
352    #include <stdio.h>
353
354    int main(int argc, char** argv) {
355      int i = 0;
356      dfsan_set_label(i_label, &i, sizeof(i));
357      int j = i + 1;
358      dfsan_print_origin_trace(&j, "A flow from i to j");
359      return 0;
360    }
361
362    % clang++ -fsanitize=dataflow -mllvm -dfsan-track-origins=1 -fno-omit-frame-pointer -g -O2 test.cc
363    % ./a.out
364    Taint value 0x1 (at 0x7ffd42bf415c) origin tracking (A flow from i to j)
365    Origin value: 0x13900001, Taint value was stored to memory at
366      #0 0x55676db85a62 in main test.cc:7:7
367      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
368
369    Origin value: 0x9e00001, Taint value was created at
370      #0 0x55676db85a08 in main test.cc:6:3
371      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
372
373By ``-mllvm -dfsan-track-origins=1`` DataFlowSanitizer collects only
374intermediate stores a labeled value went through. Origin tracking slows down
375program execution by a factor of 2x on top of the usual DataFlowSanitizer
376slowdown and increases memory overhead by 1x. By ``-mllvm -dfsan-track-origins=2``
377DataFlowSanitizer also collects intermediate loads a labeled value went through.
378This mode slows down program execution by a factor of 4x.
379
380Current status
381==============
382
383DataFlowSanitizer is a work in progress, currently under development for
384x86\_64 Linux.
385
386Design
387======
388
389Please refer to the :doc:`design document<DataFlowSanitizerDesign>`.
390