1================= 2DataFlowSanitizer 3================= 4 5.. toctree:: 6 :hidden: 7 8 DataFlowSanitizerDesign 9 10.. contents:: 11 :local: 12 13Introduction 14============ 15 16DataFlowSanitizer is a generalised dynamic data flow analysis. 17 18Unlike other Sanitizer tools, this tool is not designed to detect a 19specific class of bugs on its own. Instead, it provides a generic 20dynamic data flow analysis framework to be used by clients to help 21detect application-specific issues within their own code. 22 23How to build libc++ with DFSan 24============================== 25 26DFSan requires either all of your code to be instrumented or for uninstrumented 27functions to be listed as ``uninstrumented`` in the `ABI list`_. 28 29If you'd like to have instrumented libc++ functions, then you need to build it 30with DFSan instrumentation from source. Here is an example of how to build 31libc++ and the libc++ ABI with data flow sanitizer instrumentation. 32 33.. code-block:: console 34 35 mkdir libcxx-build 36 cd libcxx-build 37 38 # An example using ninja 39 cmake -GNinja -S <monorepo-root>/runtimes \ 40 -DCMAKE_C_COMPILER=clang \ 41 -DCMAKE_CXX_COMPILER=clang++ \ 42 -DLLVM_USE_SANITIZER="DataFlow" \ 43 -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi" 44 45 ninja cxx cxxabi 46 47Note: Ensure you are building with a sufficiently new version of Clang. 48 49Usage 50===== 51 52With no program changes, applying DataFlowSanitizer to a program 53will not alter its behavior. To use DataFlowSanitizer, the program 54uses API functions to apply tags to data to cause it to be tracked, and to 55check the tag of a specific data item. DataFlowSanitizer manages 56the propagation of tags through the program according to its data flow. 57 58The APIs are defined in the header file ``sanitizer/dfsan_interface.h``. 59For further information about each function, please refer to the header 60file. 61 62.. _ABI list: 63 64ABI List 65-------- 66 67DataFlowSanitizer uses a list of functions known as an ABI list to decide 68whether a call to a specific function should use the operating system's native 69ABI or whether it should use a variant of this ABI that also propagates labels 70through function parameters and return values. The ABI list file also controls 71how labels are propagated in the former case. DataFlowSanitizer comes with a 72default ABI list which is intended to eventually cover the glibc library on 73Linux but it may become necessary for users to extend the ABI list in cases 74where a particular library or function cannot be instrumented (e.g. because 75it is implemented in assembly or another language which DataFlowSanitizer does 76not support) or a function is called from a library or function which cannot 77be instrumented. 78 79DataFlowSanitizer's ABI list file is a :doc:`SanitizerSpecialCaseList`. 80The pass treats every function in the ``uninstrumented`` category in the 81ABI list file as conforming to the native ABI. Unless the ABI list contains 82additional categories for those functions, a call to one of those functions 83will produce a warning message, as the labelling behavior of the function 84is unknown. The other supported categories are ``discard``, ``functional`` 85and ``custom``. 86 87* ``discard`` -- To the extent that this function writes to (user-accessible) 88 memory, it also updates labels in shadow memory (this condition is trivially 89 satisfied for functions which do not write to user-accessible memory). Its 90 return value is unlabelled. 91* ``functional`` -- Like ``discard``, except that the label of its return value 92 is the union of the label of its arguments. 93* ``custom`` -- Instead of calling the function, a custom wrapper ``__dfsw_F`` 94 is called, where ``F`` is the name of the function. This function may wrap 95 the original function or provide its own implementation. This category is 96 generally used for uninstrumentable functions which write to user-accessible 97 memory or which have more complex label propagation behavior. The signature 98 of ``__dfsw_F`` is based on that of ``F`` with each argument having a 99 label of type ``dfsan_label`` appended to the argument list. If ``F`` 100 is of non-void return type a final argument of type ``dfsan_label *`` 101 is appended to which the custom function can store the label for the 102 return value. For example: 103 104.. code-block:: c++ 105 106 void f(int x); 107 void __dfsw_f(int x, dfsan_label x_label); 108 109 void *memcpy(void *dest, const void *src, size_t n); 110 void *__dfsw_memcpy(void *dest, const void *src, size_t n, 111 dfsan_label dest_label, dfsan_label src_label, 112 dfsan_label n_label, dfsan_label *ret_label); 113 114If a function defined in the translation unit being compiled belongs to the 115``uninstrumented`` category, it will be compiled so as to conform to the 116native ABI. Its arguments will be assumed to be unlabelled, but it will 117propagate labels in shadow memory. 118 119For example: 120 121.. code-block:: none 122 123 # main is called by the C runtime using the native ABI. 124 fun:main=uninstrumented 125 fun:main=discard 126 127 # malloc only writes to its internal data structures, not user-accessible memory. 128 fun:malloc=uninstrumented 129 fun:malloc=discard 130 131 # tolower is a pure function. 132 fun:tolower=uninstrumented 133 fun:tolower=functional 134 135 # memcpy needs to copy the shadow from the source to the destination region. 136 # This is done in a custom function. 137 fun:memcpy=uninstrumented 138 fun:memcpy=custom 139 140For instrumented functions, the ABI list supports a ``force_zero_labels`` 141category, which will make all stores and return values set zero labels. 142Functions should never be labelled with both ``force_zero_labels`` 143and ``uninstrumented`` or any of the uninstrumented wrapper kinds. 144 145For example: 146 147.. code-block:: none 148 149 # e.g. void writes_data(char* out_buf, int out_buf_len) {...} 150 # Applying force_zero_labels will force out_buf shadow to zero. 151 fun:writes_data=force_zero_labels 152 153 154Compilation Flags 155----------------- 156 157* ``-dfsan-abilist`` -- The additional ABI list files that control how shadow 158 parameters are passed. File names are separated by comma. 159* ``-dfsan-combine-pointer-labels-on-load`` -- Controls whether to include or 160 ignore the labels of pointers in load instructions. Its default value is true. 161 For example: 162 163.. code-block:: c++ 164 165 v = *p; 166 167If the flag is true, the label of ``v`` is the union of the label of ``p`` and 168the label of ``*p``. If the flag is false, the label of ``v`` is the label of 169just ``*p``. 170 171* ``-dfsan-combine-pointer-labels-on-store`` -- Controls whether to include or 172 ignore the labels of pointers in store instructions. Its default value is 173 false. For example: 174 175.. code-block:: c++ 176 177 *p = v; 178 179If the flag is true, the label of ``*p`` is the union of the label of ``p`` and 180the label of ``v``. If the flag is false, the label of ``*p`` is the label of 181just ``v``. 182 183* ``-dfsan-combine-offset-labels-on-gep`` -- Controls whether to propagate 184 labels of offsets in GEP instructions. Its default value is true. For example: 185 186.. code-block:: c++ 187 188 p += i; 189 190If the flag is true, the label of ``p`` is the union of the label of ``p`` and 191the label of ``i``. If the flag is false, the label of ``p`` is unchanged. 192 193* ``-dfsan-track-select-control-flow`` -- Controls whether to track the control 194 flow of select instructions. Its default value is true. For example: 195 196.. code-block:: c++ 197 198 v = b? v1: v2; 199 200If the flag is true, the label of ``v`` is the union of the labels of ``b``, 201``v1`` and ``v2``. If the flag is false, the label of ``v`` is the union of the 202labels of just ``v1`` and ``v2``. 203 204* ``-dfsan-event-callbacks`` -- An experimental feature that inserts callbacks for 205 certain data events. Currently callbacks are only inserted for loads, stores, 206 memory transfers (i.e. memcpy and memmove), and comparisons. Its default value 207 is false. If this flag is set to true, a user must provide definitions for the 208 following callback functions: 209 210.. code-block:: c++ 211 212 void __dfsan_load_callback(dfsan_label Label, void* Addr); 213 void __dfsan_store_callback(dfsan_label Label, void* Addr); 214 void __dfsan_mem_transfer_callback(dfsan_label *Start, size_t Len); 215 void __dfsan_cmp_callback(dfsan_label CombinedLabel); 216 217* ``-dfsan-conditional-callbacks`` -- An experimental feature that inserts 218 callbacks for control flow conditional expressions. 219 This can be used to find where tainted values can control execution. 220 221 In addition to this compilation flag, a callback handler must be registered 222 using ``dfsan_set_conditional_callback(my_callback);``, where my_callback is 223 a function with a signature matching 224 ``void my_callback(dfsan_label l, dfsan_origin o);``. 225 This signature is the same when origin tracking is disabled - in this case 226 the dfsan_origin passed in it will always be 0. 227 228 The callback will only be called when a tainted value reaches a conditional 229 expression for control flow (such as an if's condition). 230 The callback will be skipped for conditional expressions inside signal 231 handlers, as this is prone to deadlock. Tainted values used in conditional 232 expressions inside signal handlers will instead be aggregated via bitwise 233 or, and can be accessed using 234 ``dfsan_label dfsan_get_labels_in_signal_conditional();``. 235 236* ``-dfsan-reaches-function-callbacks`` -- An experimental feature that inserts 237 callbacks for data entering a function. 238 239 In addition to this compilation flag, a callback handler must be registered 240 using ``dfsan_set_reaches_function_callback(my_callback);``, where my_callback is 241 a function with a signature matching 242 ``void my_callback(dfsan_label label, dfsan_origin origin, const char *file, unsigned int line, const char *function);`` 243 This signature is the same when origin tracking is disabled - in this case 244 the dfsan_origin passed in it will always be 0. 245 246 The callback will be called when a tained value reach stack/registers 247 in the context of a function. Tainted values can reach a function: 248 * via the arguments of the function 249 * via the return value of a call that occurs in the function 250 * via the loaded value of a load that occurs in the function 251 252 The callback will be skipped for conditional expressions inside signal 253 handlers, as this is prone to deadlock. Tainted values reaching functions 254 inside signal handlers will instead be aggregated via bitwise or, and can 255 be accessed using 256 ``dfsan_label dfsan_get_labels_in_signal_reaches_function()``. 257 258* ``-dfsan-track-origins`` -- Controls how to track origins. When its value is 259 0, the runtime does not track origins. When its value is 1, the runtime tracks 260 origins at memory store operations. When its value is 2, the runtime tracks 261 origins at memory load and store operations. Its default value is 0. 262 263* ``-dfsan-instrument-with-call-threshold`` -- If a function being instrumented 264 requires more than this number of origin stores, use callbacks instead of 265 inline checks (-1 means never use callbacks). Its default value is 3500. 266 267Environment Variables 268--------------------- 269 270* ``warn_unimplemented`` -- Whether to warn on unimplemented functions. Its 271 default value is false. 272* ``strict_data_dependencies`` -- Whether to propagate labels only when there is 273 explicit obvious data dependency (e.g., when comparing strings, ignore the fact 274 that the output of the comparison might be implicit data-dependent on the 275 content of the strings). This applies only to functions with ``custom`` category 276 in ABI list. Its default value is true. 277* ``origin_history_size`` -- The limit of origin chain length. Non-positive values 278 mean unlimited. Its default value is 16. 279* ``origin_history_per_stack_limit`` -- The limit of origin node's references count. 280 Non-positive values mean unlimited. Its default value is 20000. 281* ``store_context_size`` -- The depth limit of origin tracking stack traces. Its 282 default value is 20. 283* ``zero_in_malloc`` -- Whether to zero shadow space of new allocated memory. Its 284 default value is true. 285* ``zero_in_free`` --- Whether to zero shadow space of deallocated memory. Its 286 default value is true. 287 288Example 289======= 290 291DataFlowSanitizer supports up to 8 labels, to achieve low CPU and code 292size overhead. Base labels are simply 8-bit unsigned integers that are 293powers of 2 (i.e. 1, 2, 4, 8, ..., 128), and union labels are created 294by ORing base labels. 295 296The following program demonstrates label propagation by checking that 297the correct labels are propagated. 298 299.. code-block:: c++ 300 301 #include <sanitizer/dfsan_interface.h> 302 #include <assert.h> 303 304 int main(void) { 305 int i = 100; 306 int j = 200; 307 int k = 300; 308 dfsan_label i_label = 1; 309 dfsan_label j_label = 2; 310 dfsan_label k_label = 4; 311 dfsan_set_label(i_label, &i, sizeof(i)); 312 dfsan_set_label(j_label, &j, sizeof(j)); 313 dfsan_set_label(k_label, &k, sizeof(k)); 314 315 dfsan_label ij_label = dfsan_get_label(i + j); 316 317 assert(ij_label & i_label); // ij_label has i_label 318 assert(ij_label & j_label); // ij_label has j_label 319 assert(!(ij_label & k_label)); // ij_label doesn't have k_label 320 assert(ij_label == 3); // Verifies all of the above 321 322 // Or, equivalently: 323 assert(dfsan_has_label(ij_label, i_label)); 324 assert(dfsan_has_label(ij_label, j_label)); 325 assert(!dfsan_has_label(ij_label, k_label)); 326 327 dfsan_label ijk_label = dfsan_get_label(i + j + k); 328 329 assert(ijk_label & i_label); // ijk_label has i_label 330 assert(ijk_label & j_label); // ijk_label has j_label 331 assert(ijk_label & k_label); // ijk_label has k_label 332 assert(ijk_label == 7); // Verifies all of the above 333 334 // Or, equivalently: 335 assert(dfsan_has_label(ijk_label, i_label)); 336 assert(dfsan_has_label(ijk_label, j_label)); 337 assert(dfsan_has_label(ijk_label, k_label)); 338 339 return 0; 340 } 341 342Origin Tracking 343=============== 344 345DataFlowSanitizer can track origins of labeled values. This feature is enabled by 346``-mllvm -dfsan-track-origins=1``. For example, 347 348.. code-block:: console 349 350 % cat test.cc 351 #include <sanitizer/dfsan_interface.h> 352 #include <stdio.h> 353 354 int main(int argc, char** argv) { 355 int i = 0; 356 dfsan_set_label(i_label, &i, sizeof(i)); 357 int j = i + 1; 358 dfsan_print_origin_trace(&j, "A flow from i to j"); 359 return 0; 360 } 361 362 % clang++ -fsanitize=dataflow -mllvm -dfsan-track-origins=1 -fno-omit-frame-pointer -g -O2 test.cc 363 % ./a.out 364 Taint value 0x1 (at 0x7ffd42bf415c) origin tracking (A flow from i to j) 365 Origin value: 0x13900001, Taint value was stored to memory at 366 #0 0x55676db85a62 in main test.cc:7:7 367 #1 0x7f0083611bbc in __libc_start_main libc-start.c:285 368 369 Origin value: 0x9e00001, Taint value was created at 370 #0 0x55676db85a08 in main test.cc:6:3 371 #1 0x7f0083611bbc in __libc_start_main libc-start.c:285 372 373By ``-mllvm -dfsan-track-origins=1`` DataFlowSanitizer collects only 374intermediate stores a labeled value went through. Origin tracking slows down 375program execution by a factor of 2x on top of the usual DataFlowSanitizer 376slowdown and increases memory overhead by 1x. By ``-mllvm -dfsan-track-origins=2`` 377DataFlowSanitizer also collects intermediate loads a labeled value went through. 378This mode slows down program execution by a factor of 4x. 379 380Current status 381============== 382 383DataFlowSanitizer is a work in progress, currently under development for 384x86\_64 Linux. 385 386Design 387====== 388 389Please refer to the :doc:`design document<DataFlowSanitizerDesign>`. 390