xref: /openbsd-src/gnu/llvm/clang/docs/DataFlowSanitizer.rst (revision c1a45aed656e7d5627c30c92421893a76f370ccb)
1=================
2DataFlowSanitizer
3=================
4
5.. toctree::
6   :hidden:
7
8   DataFlowSanitizerDesign
9
10.. contents::
11   :local:
12
13Introduction
14============
15
16DataFlowSanitizer is a generalised dynamic data flow analysis.
17
18Unlike other Sanitizer tools, this tool is not designed to detect a
19specific class of bugs on its own.  Instead, it provides a generic
20dynamic data flow analysis framework to be used by clients to help
21detect application-specific issues within their own code.
22
23How to build libc++ with DFSan
24==============================
25
26DFSan requires either all of your code to be instrumented or for uninstrumented
27functions to be listed as ``uninstrumented`` in the `ABI list`_.
28
29If you'd like to have instrumented libc++ functions, then you need to build it
30with DFSan instrumentation from source. Here is an example of how to build
31libc++ and the libc++ ABI with data flow sanitizer instrumentation.
32
33.. code-block:: console
34
35  cd libcxx-build
36
37  # An example using ninja
38  cmake -GNinja path/to/llvm-project/llvm \
39    -DCMAKE_C_COMPILER=clang \
40    -DCMAKE_CXX_COMPILER=clang++ \
41    -DLLVM_USE_SANITIZER="DataFlow" \
42    -DLLVM_ENABLE_LIBCXX=ON \
43    -DLLVM_ENABLE_PROJECTS="libcxx;libcxxabi"
44
45  ninja cxx cxxabi
46
47Note: Ensure you are building with a sufficiently new version of Clang.
48
49Usage
50=====
51
52With no program changes, applying DataFlowSanitizer to a program
53will not alter its behavior.  To use DataFlowSanitizer, the program
54uses API functions to apply tags to data to cause it to be tracked, and to
55check the tag of a specific data item.  DataFlowSanitizer manages
56the propagation of tags through the program according to its data flow.
57
58The APIs are defined in the header file ``sanitizer/dfsan_interface.h``.
59For further information about each function, please refer to the header
60file.
61
62.. _ABI list:
63
64ABI List
65--------
66
67DataFlowSanitizer uses a list of functions known as an ABI list to decide
68whether a call to a specific function should use the operating system's native
69ABI or whether it should use a variant of this ABI that also propagates labels
70through function parameters and return values.  The ABI list file also controls
71how labels are propagated in the former case.  DataFlowSanitizer comes with a
72default ABI list which is intended to eventually cover the glibc library on
73Linux but it may become necessary for users to extend the ABI list in cases
74where a particular library or function cannot be instrumented (e.g. because
75it is implemented in assembly or another language which DataFlowSanitizer does
76not support) or a function is called from a library or function which cannot
77be instrumented.
78
79DataFlowSanitizer's ABI list file is a :doc:`SanitizerSpecialCaseList`.
80The pass treats every function in the ``uninstrumented`` category in the
81ABI list file as conforming to the native ABI.  Unless the ABI list contains
82additional categories for those functions, a call to one of those functions
83will produce a warning message, as the labelling behavior of the function
84is unknown.  The other supported categories are ``discard``, ``functional``
85and ``custom``.
86
87* ``discard`` -- To the extent that this function writes to (user-accessible)
88  memory, it also updates labels in shadow memory (this condition is trivially
89  satisfied for functions which do not write to user-accessible memory).  Its
90  return value is unlabelled.
91* ``functional`` -- Like ``discard``, except that the label of its return value
92  is the union of the label of its arguments.
93* ``custom`` -- Instead of calling the function, a custom wrapper ``__dfsw_F``
94  is called, where ``F`` is the name of the function.  This function may wrap
95  the original function or provide its own implementation.  This category is
96  generally used for uninstrumentable functions which write to user-accessible
97  memory or which have more complex label propagation behavior.  The signature
98  of ``__dfsw_F`` is based on that of ``F`` with each argument having a
99  label of type ``dfsan_label`` appended to the argument list.  If ``F``
100  is of non-void return type a final argument of type ``dfsan_label *``
101  is appended to which the custom function can store the label for the
102  return value.  For example:
103
104.. code-block:: c++
105
106  void f(int x);
107  void __dfsw_f(int x, dfsan_label x_label);
108
109  void *memcpy(void *dest, const void *src, size_t n);
110  void *__dfsw_memcpy(void *dest, const void *src, size_t n,
111                      dfsan_label dest_label, dfsan_label src_label,
112                      dfsan_label n_label, dfsan_label *ret_label);
113
114If a function defined in the translation unit being compiled belongs to the
115``uninstrumented`` category, it will be compiled so as to conform to the
116native ABI.  Its arguments will be assumed to be unlabelled, but it will
117propagate labels in shadow memory.
118
119For example:
120
121.. code-block:: none
122
123  # main is called by the C runtime using the native ABI.
124  fun:main=uninstrumented
125  fun:main=discard
126
127  # malloc only writes to its internal data structures, not user-accessible memory.
128  fun:malloc=uninstrumented
129  fun:malloc=discard
130
131  # tolower is a pure function.
132  fun:tolower=uninstrumented
133  fun:tolower=functional
134
135  # memcpy needs to copy the shadow from the source to the destination region.
136  # This is done in a custom function.
137  fun:memcpy=uninstrumented
138  fun:memcpy=custom
139
140Compilation Flags
141-----------------
142
143* ``-dfsan-abilist`` -- The additional ABI list files that control how shadow
144  parameters are passed. File names are separated by comma.
145* ``-dfsan-combine-pointer-labels-on-load`` -- Controls whether to include or
146  ignore the labels of pointers in load instructions. Its default value is true.
147  For example:
148
149.. code-block:: c++
150
151  v = *p;
152
153If the flag is true, the label of ``v`` is the union of the label of ``p`` and
154the label of ``*p``. If the flag is false, the label of ``v`` is the label of
155just ``*p``.
156
157* ``-dfsan-combine-pointer-labels-on-store`` -- Controls whether to include or
158  ignore the labels of pointers in store instructions. Its default value is
159  false. For example:
160
161.. code-block:: c++
162
163  *p = v;
164
165If the flag is true, the label of ``*p`` is the union of the label of ``p`` and
166the label of ``v``. If the flag is false, the label of ``*p`` is the label of
167just ``v``.
168
169* ``-dfsan-combine-offset-labels-on-gep`` -- Controls whether to propagate
170  labels of offsets in GEP instructions. Its default value is true. For example:
171
172.. code-block:: c++
173
174  p += i;
175
176If the flag is true, the label of ``p`` is the union of the label of ``p`` and
177the label of ``i``. If the flag is false, the label of ``p`` is unchanged.
178
179* ``-dfsan-track-select-control-flow`` -- Controls whether to track the control
180  flow of select instructions. Its default value is true. For example:
181
182.. code-block:: c++
183
184  v = b? v1: v2;
185
186If the flag is true, the label of ``v`` is the union of the labels of ``b``,
187``v1`` and ``v2``.  If the flag is false, the label of ``v`` is the union of the
188labels of just ``v1`` and ``v2``.
189
190* ``-dfsan-event-callbacks`` -- An experimental feature that inserts callbacks for
191  certain data events. Currently callbacks are only inserted for loads, stores,
192  memory transfers (i.e. memcpy and memmove), and comparisons. Its default value
193  is false. If this flag is set to true, a user must provide definitions for the
194  following callback functions:
195
196.. code-block:: c++
197
198  void __dfsan_load_callback(dfsan_label Label, void* Addr);
199  void __dfsan_store_callback(dfsan_label Label, void* Addr);
200  void __dfsan_mem_transfer_callback(dfsan_label *Start, size_t Len);
201  void __dfsan_cmp_callback(dfsan_label CombinedLabel);
202
203* ``-dfsan-track-origins`` -- Controls how to track origins. When its value is
204  0, the runtime does not track origins. When its value is 1, the runtime tracks
205  origins at memory store operations. When its value is 2, the runtime tracks
206  origins at memory load and store operations. Its default value is 0.
207
208* ``-dfsan-instrument-with-call-threshold`` -- If a function being instrumented
209  requires more than this number of origin stores, use callbacks instead of
210  inline checks (-1 means never use callbacks). Its default value is 3500.
211
212Environment Variables
213---------------------
214
215* ``warn_unimplemented`` -- Whether to warn on unimplemented functions. Its
216  default value is false.
217* ``strict_data_dependencies`` -- Whether to propagate labels only when there is
218  explicit obvious data dependency (e.g., when comparing strings, ignore the fact
219  that the output of the comparison might be implicit data-dependent on the
220  content of the strings). This applies only to functions with ``custom`` category
221  in ABI list. Its default value is true.
222* ``origin_history_size`` -- The limit of origin chain length. Non-positive values
223  mean unlimited. Its default value is 16.
224* ``origin_history_per_stack_limit`` -- The limit of origin node's references count.
225  Non-positive values mean unlimited. Its default value is 20000.
226* ``store_context_size`` -- The depth limit of origin tracking stack traces. Its
227  default value is 20.
228* ``zero_in_malloc`` -- Whether to zero shadow space of new allocated memory. Its
229  default value is true.
230* ``zero_in_free`` --- Whether to zero shadow space of deallocated memory. Its
231  default value is true.
232
233Example
234=======
235
236DataFlowSanitizer supports up to 8 labels, to achieve low CPU and code
237size overhead. Base labels are simply 8-bit unsigned integers that are
238powers of 2 (i.e. 1, 2, 4, 8, ..., 128), and union labels are created
239by ORing base labels.
240
241The following program demonstrates label propagation by checking that
242the correct labels are propagated.
243
244.. code-block:: c++
245
246  #include <sanitizer/dfsan_interface.h>
247  #include <assert.h>
248
249  int main(void) {
250    int i = 100;
251    int j = 200;
252    int k = 300;
253    dfsan_label i_label = 1;
254    dfsan_label j_label = 2;
255    dfsan_label k_label = 4;
256    dfsan_set_label(i_label, &i, sizeof(i));
257    dfsan_set_label(j_label, &j, sizeof(j));
258    dfsan_set_label(k_label, &k, sizeof(k));
259
260    dfsan_label ij_label = dfsan_get_label(i + j);
261
262    assert(ij_label & i_label);  // ij_label has i_label
263    assert(ij_label & j_label);  // ij_label has j_label
264    assert(!(ij_label & k_label));  // ij_label doesn't have k_label
265    assert(ij_label == 3);  // Verifies all of the above
266
267    // Or, equivalently:
268    assert(dfsan_has_label(ij_label, i_label));
269    assert(dfsan_has_label(ij_label, j_label));
270    assert(!dfsan_has_label(ij_label, k_label));
271
272    dfsan_label ijk_label = dfsan_get_label(i + j + k);
273
274    assert(ijk_label & i_label);  // ijk_label has i_label
275    assert(ijk_label & j_label);  // ijk_label has j_label
276    assert(ijk_label & k_label);  // ijk_label has k_label
277    assert(ijk_label == 7);  // Verifies all of the above
278
279    // Or, equivalently:
280    assert(dfsan_has_label(ijk_label, i_label));
281    assert(dfsan_has_label(ijk_label, j_label));
282    assert(dfsan_has_label(ijk_label, k_label));
283
284    return 0;
285  }
286
287Origin Tracking
288===============
289
290DataFlowSanitizer can track origins of labeled values. This feature is enabled by
291``-mllvm -dfsan-track-origins=1``. For example,
292
293.. code-block:: console
294
295    % cat test.cc
296    #include <sanitizer/dfsan_interface.h>
297    #include <stdio.h>
298
299    int main(int argc, char** argv) {
300      int i = 0;
301      dfsan_set_label(i_label, &i, sizeof(i));
302      int j = i + 1;
303      dfsan_print_origin_trace(&j, "A flow from i to j");
304      return 0;
305    }
306
307    % clang++ -fsanitize=dataflow -mllvm -dfsan-track-origins=1 -fno-omit-frame-pointer -g -O2 test.cc
308    % ./a.out
309    Taint value 0x1 (at 0x7ffd42bf415c) origin tracking (A flow from i to j)
310    Origin value: 0x13900001, Taint value was stored to memory at
311      #0 0x55676db85a62 in main test.cc:7:7
312      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
313
314    Origin value: 0x9e00001, Taint value was created at
315      #0 0x55676db85a08 in main test.cc:6:3
316      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
317
318By ``-mllvm -dfsan-track-origins=1`` DataFlowSanitizer collects only
319intermediate stores a labeled value went through. Origin tracking slows down
320program execution by a factor of 2x on top of the usual DataFlowSanitizer
321slowdown and increases memory overhead by 1x. By ``-mllvm -dfsan-track-origins=2``
322DataFlowSanitizer also collects intermediate loads a labeled value went through.
323This mode slows down program execution by a factor of 4x.
324
325Current status
326==============
327
328DataFlowSanitizer is a work in progress, currently under development for
329x86\_64 Linux.
330
331Design
332======
333
334Please refer to the :doc:`design document<DataFlowSanitizerDesign>`.
335