xref: /minix3/external/bsd/llvm/dist/clang/docs/DataFlowSanitizerDesign.rst (revision f4a2713ac843a11c696ec80c0a5e3e5d80b4d338)
1*f4a2713aSLionel SambucDataFlowSanitizer Design Document
2*f4a2713aSLionel Sambuc=================================
3*f4a2713aSLionel Sambuc
4*f4a2713aSLionel SambucThis document sets out the design for DataFlowSanitizer, a general
5*f4a2713aSLionel Sambucdynamic data flow analysis.  Unlike other Sanitizer tools, this tool is
6*f4a2713aSLionel Sambucnot designed to detect a specific class of bugs on its own. Instead,
7*f4a2713aSLionel Sambucit provides a generic dynamic data flow analysis framework to be used
8*f4a2713aSLionel Sambucby clients to help detect application-specific issues within their
9*f4a2713aSLionel Sambucown code.
10*f4a2713aSLionel Sambuc
11*f4a2713aSLionel SambucDataFlowSanitizer is a program instrumentation which can associate
12*f4a2713aSLionel Sambuca number of taint labels with any data stored in any memory region
13*f4a2713aSLionel Sambucaccessible by the program. The analysis is dynamic, which means that
14*f4a2713aSLionel Sambucit operates on a running program, and tracks how the labels propagate
15*f4a2713aSLionel Sambucthrough that program. The tool shall support a large (>100) number
16*f4a2713aSLionel Sambucof labels, such that programs which operate on large numbers of data
17*f4a2713aSLionel Sambucitems may be analysed with each data item being tracked separately.
18*f4a2713aSLionel Sambuc
19*f4a2713aSLionel SambucUse Cases
20*f4a2713aSLionel Sambuc---------
21*f4a2713aSLionel Sambuc
22*f4a2713aSLionel SambucThis instrumentation can be used as a tool to help monitor how data
23*f4a2713aSLionel Sambucflows from a program's inputs (sources) to its outputs (sinks).
24*f4a2713aSLionel SambucThis has applications from a privacy/security perspective in that
25*f4a2713aSLionel Sambucone can audit how a sensitive data item is used within a program and
26*f4a2713aSLionel Sambucensure it isn't exiting the program anywhere it shouldn't be.
27*f4a2713aSLionel Sambuc
28*f4a2713aSLionel SambucInterface
29*f4a2713aSLionel Sambuc---------
30*f4a2713aSLionel Sambuc
31*f4a2713aSLionel SambucA number of functions are provided which will create taint labels,
32*f4a2713aSLionel Sambucattach labels to memory regions and extract the set of labels
33*f4a2713aSLionel Sambucassociated with a specific memory region. These functions are declared
34*f4a2713aSLionel Sambucin the header file ``sanitizer/dfsan_interface.h``.
35*f4a2713aSLionel Sambuc
36*f4a2713aSLionel Sambuc.. code-block:: c
37*f4a2713aSLionel Sambuc
38*f4a2713aSLionel Sambuc  /// Creates and returns a base label with the given description and user data.
39*f4a2713aSLionel Sambuc  dfsan_label dfsan_create_label(const char *desc, void *userdata);
40*f4a2713aSLionel Sambuc
41*f4a2713aSLionel Sambuc  /// Sets the label for each address in [addr,addr+size) to \c label.
42*f4a2713aSLionel Sambuc  void dfsan_set_label(dfsan_label label, void *addr, size_t size);
43*f4a2713aSLionel Sambuc
44*f4a2713aSLionel Sambuc  /// Sets the label for each address in [addr,addr+size) to the union of the
45*f4a2713aSLionel Sambuc  /// current label for that address and \c label.
46*f4a2713aSLionel Sambuc  void dfsan_add_label(dfsan_label label, void *addr, size_t size);
47*f4a2713aSLionel Sambuc
48*f4a2713aSLionel Sambuc  /// Retrieves the label associated with the given data.
49*f4a2713aSLionel Sambuc  ///
50*f4a2713aSLionel Sambuc  /// The type of 'data' is arbitrary.  The function accepts a value of any type,
51*f4a2713aSLionel Sambuc  /// which can be truncated or extended (implicitly or explicitly) as necessary.
52*f4a2713aSLionel Sambuc  /// The truncation/extension operations will preserve the label of the original
53*f4a2713aSLionel Sambuc  /// value.
54*f4a2713aSLionel Sambuc  dfsan_label dfsan_get_label(long data);
55*f4a2713aSLionel Sambuc
56*f4a2713aSLionel Sambuc  /// Retrieves a pointer to the dfsan_label_info struct for the given label.
57*f4a2713aSLionel Sambuc  const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label);
58*f4a2713aSLionel Sambuc
59*f4a2713aSLionel Sambuc  /// Returns whether the given label label contains the label elem.
60*f4a2713aSLionel Sambuc  int dfsan_has_label(dfsan_label label, dfsan_label elem);
61*f4a2713aSLionel Sambuc
62*f4a2713aSLionel Sambuc  /// If the given label label contains a label with the description desc, returns
63*f4a2713aSLionel Sambuc  /// that label, else returns 0.
64*f4a2713aSLionel Sambuc  dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc);
65*f4a2713aSLionel Sambuc
66*f4a2713aSLionel SambucTaint label representation
67*f4a2713aSLionel Sambuc--------------------------
68*f4a2713aSLionel Sambuc
69*f4a2713aSLionel SambucAs stated above, the tool must track a large number of taint
70*f4a2713aSLionel Sambuclabels. This poses an implementation challenge, as most multiple-label
71*f4a2713aSLionel Sambuctainting systems assign one label per bit to shadow storage, and
72*f4a2713aSLionel Sambucunion taint labels using a bitwise or operation. This will not scale
73*f4a2713aSLionel Sambucto clients which use hundreds or thousands of taint labels, as the
74*f4a2713aSLionel Sambuclabel union operation becomes O(n) in the number of supported labels,
75*f4a2713aSLionel Sambucand data associated with it will quickly dominate the live variable
76*f4a2713aSLionel Sambucset, causing register spills and hampering performance.
77*f4a2713aSLionel Sambuc
78*f4a2713aSLionel SambucInstead, a low overhead approach is proposed which is best-case O(log\
79*f4a2713aSLionel Sambuc:sub:`2` n) during execution. The underlying assumption is that
80*f4a2713aSLionel Sambucthe required space of label unions is sparse, which is a reasonable
81*f4a2713aSLionel Sambucassumption to make given that we are optimizing for the case where
82*f4a2713aSLionel Sambucapplications mostly copy data from one place to another, without often
83*f4a2713aSLionel Sambucinvoking the need for an actual union operation. The representation
84*f4a2713aSLionel Sambucof a taint label is a 16-bit integer, and new labels are allocated
85*f4a2713aSLionel Sambucsequentially from a pool. The label identifier 0 is special, and means
86*f4a2713aSLionel Sambucthat the data item is unlabelled.
87*f4a2713aSLionel Sambuc
88*f4a2713aSLionel SambucWhen a label union operation is requested at a join point (any
89*f4a2713aSLionel Sambucarithmetic or logical operation with two or more operands, such as
90*f4a2713aSLionel Sambucaddition), the code checks whether a union is required, whether the
91*f4a2713aSLionel Sambucsame union has been requested before, and whether one union label
92*f4a2713aSLionel Sambucsubsumes the other. If so, it returns the previously allocated union
93*f4a2713aSLionel Sambuclabel. If not, it allocates a new union label from the same pool used
94*f4a2713aSLionel Sambucfor new labels.
95*f4a2713aSLionel Sambuc
96*f4a2713aSLionel SambucSpecifically, the instrumentation pass will insert code like this
97*f4a2713aSLionel Sambucto decide the union label ``lu`` for a pair of labels ``l1``
98*f4a2713aSLionel Sambucand ``l2``:
99*f4a2713aSLionel Sambuc
100*f4a2713aSLionel Sambuc.. code-block:: c
101*f4a2713aSLionel Sambuc
102*f4a2713aSLionel Sambuc  if (l1 == l2)
103*f4a2713aSLionel Sambuc    lu = l1;
104*f4a2713aSLionel Sambuc  else
105*f4a2713aSLionel Sambuc    lu = __dfsan_union(l1, l2);
106*f4a2713aSLionel Sambuc
107*f4a2713aSLionel SambucThe equality comparison is outlined, to provide an early exit in
108*f4a2713aSLionel Sambucthe common cases where the program is processing unlabelled data, or
109*f4a2713aSLionel Sambucwhere the two data items have the same label.  ``__dfsan_union`` is
110*f4a2713aSLionel Sambuca runtime library function which performs all other union computation.
111*f4a2713aSLionel Sambuc
112*f4a2713aSLionel SambucFurther optimizations are possible, for example if ``l1`` is known
113*f4a2713aSLionel Sambucat compile time to be zero (e.g. it is derived from a constant),
114*f4a2713aSLionel Sambuc``l2`` can be used for ``lu``, and vice versa.
115*f4a2713aSLionel Sambuc
116*f4a2713aSLionel SambucMemory layout and label management
117*f4a2713aSLionel Sambuc----------------------------------
118*f4a2713aSLionel Sambuc
119*f4a2713aSLionel SambucThe following is the current memory layout for Linux/x86\_64:
120*f4a2713aSLionel Sambuc
121*f4a2713aSLionel Sambuc+---------------+---------------+--------------------+
122*f4a2713aSLionel Sambuc|    Start      |    End        |        Use         |
123*f4a2713aSLionel Sambuc+===============+===============+====================+
124*f4a2713aSLionel Sambuc| 0x700000008000|0x800000000000 | application memory |
125*f4a2713aSLionel Sambuc+---------------+---------------+--------------------+
126*f4a2713aSLionel Sambuc| 0x200200000000|0x700000008000 |       unused       |
127*f4a2713aSLionel Sambuc+---------------+---------------+--------------------+
128*f4a2713aSLionel Sambuc| 0x200000000000|0x200200000000 |    union table     |
129*f4a2713aSLionel Sambuc+---------------+---------------+--------------------+
130*f4a2713aSLionel Sambuc| 0x000000010000|0x200000000000 |   shadow memory    |
131*f4a2713aSLionel Sambuc+---------------+---------------+--------------------+
132*f4a2713aSLionel Sambuc| 0x000000000000|0x000000010000 | reserved by kernel |
133*f4a2713aSLionel Sambuc+---------------+---------------+--------------------+
134*f4a2713aSLionel Sambuc
135*f4a2713aSLionel SambucEach byte of application memory corresponds to two bytes of shadow
136*f4a2713aSLionel Sambucmemory, which are used to store its taint label. As for LLVM SSA
137*f4a2713aSLionel Sambucregisters, we have not found it necessary to associate a label with
138*f4a2713aSLionel Sambuceach byte or bit of data, as some other tools do. Instead, labels are
139*f4a2713aSLionel Sambucassociated directly with registers.  Loads will result in a union of
140*f4a2713aSLionel Sambucall shadow labels corresponding to bytes loaded (which most of the
141*f4a2713aSLionel Sambuctime will be short circuited by the initial comparison) and stores will
142*f4a2713aSLionel Sambucresult in a copy of the label to the shadow of all bytes stored to.
143*f4a2713aSLionel Sambuc
144*f4a2713aSLionel SambucPropagating labels through arguments
145*f4a2713aSLionel Sambuc------------------------------------
146*f4a2713aSLionel Sambuc
147*f4a2713aSLionel SambucIn order to propagate labels through function arguments and return values,
148*f4a2713aSLionel SambucDataFlowSanitizer changes the ABI of each function in the translation unit.
149*f4a2713aSLionel SambucThere are currently two supported ABIs:
150*f4a2713aSLionel Sambuc
151*f4a2713aSLionel Sambuc* Args -- Argument and return value labels are passed through additional
152*f4a2713aSLionel Sambuc  arguments and by modifying the return type.
153*f4a2713aSLionel Sambuc
154*f4a2713aSLionel Sambuc* TLS -- Argument and return value labels are passed through TLS variables
155*f4a2713aSLionel Sambuc  ``__dfsan_arg_tls`` and ``__dfsan_retval_tls``.
156*f4a2713aSLionel Sambuc
157*f4a2713aSLionel SambucThe main advantage of the TLS ABI is that it is more tolerant of ABI mismatches
158*f4a2713aSLionel Sambuc(TLS storage is not shared with any other form of storage, whereas extra
159*f4a2713aSLionel Sambucarguments may be stored in registers which under the native ABI are not used
160*f4a2713aSLionel Sambucfor parameter passing and thus could contain arbitrary values).  On the other
161*f4a2713aSLionel Sambuchand the args ABI is more efficient and allows ABI mismatches to be more easily
162*f4a2713aSLionel Sambucidentified by checking for nonzero labels in nominally unlabelled programs.
163*f4a2713aSLionel Sambuc
164*f4a2713aSLionel SambucImplementing the ABI list
165*f4a2713aSLionel Sambuc-------------------------
166*f4a2713aSLionel Sambuc
167*f4a2713aSLionel SambucThe `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions
168*f4a2713aSLionel Sambucwhich conform to the native ABI, each of which is callable from an instrumented
169*f4a2713aSLionel Sambucprogram.  This is implemented by replacing each reference to a native ABI
170*f4a2713aSLionel Sambucfunction with a reference to a function which uses the instrumented ABI.
171*f4a2713aSLionel SambucSuch functions are automatically-generated wrappers for the native functions.
172*f4a2713aSLionel SambucFor example, given the ABI list example provided in the user manual, the
173*f4a2713aSLionel Sambucfollowing wrappers will be generated under the args ABI:
174*f4a2713aSLionel Sambuc
175*f4a2713aSLionel Sambuc.. code-block:: llvm
176*f4a2713aSLionel Sambuc
177*f4a2713aSLionel Sambuc    define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) {
178*f4a2713aSLionel Sambuc    entry:
179*f4a2713aSLionel Sambuc      %2 = call i8* @malloc(i64 %0)
180*f4a2713aSLionel Sambuc      %3 = insertvalue { i8*, i16 } undef, i8* %2, 0
181*f4a2713aSLionel Sambuc      %4 = insertvalue { i8*, i16 } %3, i16 0, 1
182*f4a2713aSLionel Sambuc      ret { i8*, i16 } %4
183*f4a2713aSLionel Sambuc    }
184*f4a2713aSLionel Sambuc
185*f4a2713aSLionel Sambuc    define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) {
186*f4a2713aSLionel Sambuc    entry:
187*f4a2713aSLionel Sambuc      %2 = call i32 @tolower(i32 %0)
188*f4a2713aSLionel Sambuc      %3 = insertvalue { i32, i16 } undef, i32 %2, 0
189*f4a2713aSLionel Sambuc      %4 = insertvalue { i32, i16 } %3, i16 %1, 1
190*f4a2713aSLionel Sambuc      ret { i32, i16 } %4
191*f4a2713aSLionel Sambuc    }
192*f4a2713aSLionel Sambuc
193*f4a2713aSLionel Sambuc    define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) {
194*f4a2713aSLionel Sambuc    entry:
195*f4a2713aSLionel Sambuc      %labelreturn = alloca i16
196*f4a2713aSLionel Sambuc      %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn)
197*f4a2713aSLionel Sambuc      %7 = load i16* %labelreturn
198*f4a2713aSLionel Sambuc      %8 = insertvalue { i8*, i16 } undef, i8* %6, 0
199*f4a2713aSLionel Sambuc      %9 = insertvalue { i8*, i16 } %8, i16 %7, 1
200*f4a2713aSLionel Sambuc      ret { i8*, i16 } %9
201*f4a2713aSLionel Sambuc    }
202*f4a2713aSLionel Sambuc
203*f4a2713aSLionel SambucAs an optimization, direct calls to native ABI functions will call the
204*f4a2713aSLionel Sambucnative ABI function directly and the pass will compute the appropriate label
205*f4a2713aSLionel Sambucinternally.  This has the advantage of reducing the number of union operations
206*f4a2713aSLionel Sambucrequired when the return value label is known to be zero (i.e. ``discard``
207*f4a2713aSLionel Sambucfunctions, or ``functional`` functions with known unlabelled arguments).
208*f4a2713aSLionel Sambuc
209*f4a2713aSLionel SambucChecking ABI Consistency
210*f4a2713aSLionel Sambuc------------------------
211*f4a2713aSLionel Sambuc
212*f4a2713aSLionel SambucDFSan changes the ABI of each function in the module.  This makes it possible
213*f4a2713aSLionel Sambucfor a function with the native ABI to be called with the instrumented ABI,
214*f4a2713aSLionel Sambucor vice versa, thus possibly invoking undefined behavior.  A simple way
215*f4a2713aSLionel Sambucof statically detecting instances of this problem is to prepend the prefix
216*f4a2713aSLionel Sambuc"dfs$" to the name of each instrumented-ABI function.
217*f4a2713aSLionel Sambuc
218*f4a2713aSLionel SambucThis will not catch every such problem; in particular function pointers passed
219*f4a2713aSLionel Sambucacross the instrumented-native barrier cannot be used on the other side.
220*f4a2713aSLionel SambucThese problems could potentially be caught dynamically.
221