xref: /llvm-project/clang/docs/DataFlowAnalysisIntro.md (revision cc8237d9d727b864fe65d2c9d620f3332f4235ca)
1ab31d003SDmitri Gribenko# Data flow analysis: an informal introduction
2ab31d003SDmitri Gribenko
3ab31d003SDmitri Gribenko## Abstract
4ab31d003SDmitri Gribenko
5ab31d003SDmitri GribenkoThis document introduces data flow analysis in an informal way. The goal is to
6ab31d003SDmitri Gribenkogive the reader an intuitive understanding of how it works, and show how it
7ab31d003SDmitri Gribenkoapplies to a range of refactoring and bug finding problems.
8ab31d003SDmitri Gribenko
9ab31d003SDmitri GribenkoData flow analysis is a well-established technique; it is described in many
10ab31d003SDmitri Gribenkopapers, books, and videos. If you would like a more formal, or a more thorough
11ab31d003SDmitri Gribenkoexplanation of the concepts mentioned in this document, please refer to the
12ab31d003SDmitri Gribenkofollowing resources:
13ab31d003SDmitri Gribenko
14ab31d003SDmitri Gribenko*   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
15ab31d003SDmitri Gribenko*   Videos on the PacketPrep YouTube channel that introduce lattices and the
16ab31d003SDmitri Gribenko    necessary background information:
17ab31d003SDmitri Gribenko    [#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
18ab31d003SDmitri Gribenko    [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
19ab31d003SDmitri Gribenko    [#22](https://www.youtube.com/watch?v=saOG7Uooeho),
20ab31d003SDmitri Gribenko    [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
21ab31d003SDmitri Gribenko    [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
22ab31d003SDmitri Gribenko    [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
23ab31d003SDmitri Gribenko*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
24ab31d003SDmitri Gribenko*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
25ab31d003SDmitri Gribenko*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
26ab31d003SDmitri Gribenko*   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
27ab31d003SDmitri Gribenko*   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
28ab31d003SDmitri Gribenko    (a paper that successfully applies symbolic execution to real-world
29ab31d003SDmitri Gribenko    software).
30ab31d003SDmitri Gribenko
31ab31d003SDmitri Gribenko## Data flow analysis
32ab31d003SDmitri Gribenko
33ab31d003SDmitri Gribenko### The purpose of data flow analysis
34ab31d003SDmitri Gribenko
35ab31d003SDmitri GribenkoData flow analysis is a static analysis technique that proves facts about a
36ab31d003SDmitri Gribenkoprogram or its fragment. It can make conclusions about all paths through the
37ab31d003SDmitri Gribenkoprogram, while taking control flow into account and scaling to large programs.
38ab31d003SDmitri GribenkoThe basic idea is propagating facts about the program through the edges of the
39ab31d003SDmitri Gribenkocontrol flow graph (CFG) until a fixpoint is reached.
40ab31d003SDmitri Gribenko
41ab31d003SDmitri Gribenko### Sample problem and an ad-hoc solution
42ab31d003SDmitri Gribenko
43ab31d003SDmitri GribenkoWe would like to explain data flow analysis while discussing an example. Let's
44ab31d003SDmitri Gribenkoimagine that we want to track possible values of an integer variable in our
45ab31d003SDmitri Gribenkoprogram. Here is how a human could annotate the code:
46ab31d003SDmitri Gribenko
47ab31d003SDmitri Gribenko```c++
48ab31d003SDmitri Gribenkovoid Example(int n) {
49ab31d003SDmitri Gribenko  int x = 0;
50ab31d003SDmitri Gribenko  // x is {0}
51ab31d003SDmitri Gribenko  if (n > 0) {
52ab31d003SDmitri Gribenko    x = 5;
53ab31d003SDmitri Gribenko    // x is {5}
54ab31d003SDmitri Gribenko  } else {
55ab31d003SDmitri Gribenko    x = 42;
56ab31d003SDmitri Gribenko    // x is {42}
57ab31d003SDmitri Gribenko  }
58ab31d003SDmitri Gribenko  // x is {5; 42}
59ab31d003SDmitri Gribenko  print(x);
60ab31d003SDmitri Gribenko}
61ab31d003SDmitri Gribenko```
62ab31d003SDmitri Gribenko
63ab31d003SDmitri GribenkoWe use sets of integers to represent possible values of `x`. Local variables
64ab31d003SDmitri Gribenkohave unambiguous values between statements, so we annotate program points
65ab31d003SDmitri Gribenkobetween statements with sets of possible values.
66ab31d003SDmitri Gribenko
67ab31d003SDmitri GribenkoHere is how we arrived at these annotations. Assigning a constant to `x` allows
68ab31d003SDmitri Gribenkous to make a conclusion that `x` can only have one value. When control flow from
69ab31d003SDmitri Gribenkothe "then" and "else" branches joins, `x` can have either value.
70ab31d003SDmitri Gribenko
71ab31d003SDmitri GribenkoAbstract algebra provides a nice formalism that models this kind of structure,
72ab31d003SDmitri Gribenkonamely, a lattice. A join-semilattice is a partially ordered set, in which every
73ab31d003SDmitri Gribenkotwo elements have a least upper bound (called a *join*).
74ab31d003SDmitri Gribenko
75ab31d003SDmitri Gribenko```
76ab31d003SDmitri Gribenkojoin(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x
77ab31d003SDmitri Gribenko```
78ab31d003SDmitri Gribenko
79ab31d003SDmitri GribenkoFor this problem we will use the lattice of subsets of integers, with set
80ab31d003SDmitri Gribenkoinclusion relation as ordering and set union as a join.
81ab31d003SDmitri Gribenko
82ab31d003SDmitri GribenkoLattices are often represented visually as Hasse diagrams. Here is a Hasse
83ab31d003SDmitri Gribenkodiagram for our lattice that tracks subsets of integers:
84ab31d003SDmitri Gribenko
85ab31d003SDmitri Gribenko![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)
86ab31d003SDmitri Gribenko
87ab31d003SDmitri GribenkoComputing the join in the lattice corresponds to finding the lowest common
88ab31d003SDmitri Gribenkoancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
89ab31d003SDmitri Gribenkoliterature on efficiently implementing LCA queries for a DAG, however Efficient
90ab31d003SDmitri GribenkoImplementation of Lattice Operations (1989)
91ab31d003SDmitri Gribenko([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
92ab31d003SDmitri Gribenko[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
93ab31d003SDmitri Gribenkoparticularly well-suited for programmatic implementation.
94ab31d003SDmitri Gribenko
95ab31d003SDmitri Gribenko### Too much information and "top" values
96ab31d003SDmitri Gribenko
97ab31d003SDmitri GribenkoLet's try to find the possible sets of values of `x` in a function that modifies
98ab31d003SDmitri Gribenko`x` in a loop:
99ab31d003SDmitri Gribenko
100ab31d003SDmitri Gribenko```c++
101ab31d003SDmitri Gribenkovoid ExampleOfInfiniteSets() {
102ab31d003SDmitri Gribenko  int x = 0; // x is {0}
103ab31d003SDmitri Gribenko  while (condition()) {
104ab31d003SDmitri Gribenko    x += 1;  // x is {0; 1; 2; …}
105ab31d003SDmitri Gribenko  }
106ab31d003SDmitri Gribenko  print(x);  // x is {0; 1; 2; …}
107ab31d003SDmitri Gribenko}
108ab31d003SDmitri Gribenko```
109ab31d003SDmitri Gribenko
110ab31d003SDmitri GribenkoWe have an issue: `x` can have any value greater than zero; that's an infinite
111ab31d003SDmitri Gribenkoset of values, if the program operated on mathematical integers. In C++ `int` is
112ab31d003SDmitri Gribenkolimited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
113ab31d003SDmitri Gribenkostill really big.
114ab31d003SDmitri Gribenko
115ab31d003SDmitri GribenkoTo make our analysis practical to compute, we have to limit the amount of
116ab31d003SDmitri Gribenkoinformation that we track. In this case, we can, for example, arbitrarily limit
117ab31d003SDmitri Gribenkothe size of sets to 3 elements. If at a certain program point `x` has more than
118ab31d003SDmitri Gribenko3 possible values, we stop tracking specific values at that program point.
119ab31d003SDmitri GribenkoInstead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
120ab31d003SDmitri Gribenkoaccording to a convention in abstract algebra).
121ab31d003SDmitri Gribenko
122ab31d003SDmitri Gribenko```c++
123ab31d003SDmitri Gribenkovoid ExampleOfTopWithALoop() {
124ab31d003SDmitri Gribenko  int x = 0;  // x is {0}
125ab31d003SDmitri Gribenko  while (condition()) {
126ab31d003SDmitri Gribenko    x += 1;   // x is ⊤
127ab31d003SDmitri Gribenko  }
128ab31d003SDmitri Gribenko  print(x);   // x is ⊤
129ab31d003SDmitri Gribenko}
130ab31d003SDmitri Gribenko```
131ab31d003SDmitri Gribenko
132ab31d003SDmitri GribenkoThe statement "at this program point, `x`'s possible values are `⊤`" is
133ab31d003SDmitri Gribenkounderstood as "at this program point `x` can have any value because we have too
134ab31d003SDmitri Gribenkomuch information, or the information is conflicting".
135ab31d003SDmitri Gribenko
136ab31d003SDmitri GribenkoNote that we can get more than 3 possible values even without a loop:
137ab31d003SDmitri Gribenko
138ab31d003SDmitri Gribenko```c++
139ab31d003SDmitri Gribenkovoid ExampleOfTopWithoutLoops(int n) {
140ab31d003SDmitri Gribenko  int x = 0;  // x is {0}
141ab31d003SDmitri Gribenko  switch(n) {
142ab31d003SDmitri Gribenko    case 0:  x = 1; break; // x is {1}
143ab31d003SDmitri Gribenko    case 1:  x = 9; break; // x is {9}
144ab31d003SDmitri Gribenko    case 2:  x = 7; break; // x is {7}
145ab31d003SDmitri Gribenko    default: x = 3; break; // x is {3}
146ab31d003SDmitri Gribenko  }
147ab31d003SDmitri Gribenko  // x is ⊤
148ab31d003SDmitri Gribenko}
149ab31d003SDmitri Gribenko```
150ab31d003SDmitri Gribenko
151ab31d003SDmitri Gribenko### Uninitialized variables and "bottom" values
152ab31d003SDmitri Gribenko
153ab31d003SDmitri GribenkoWhen `x` is declared but not initialized, it has no possible values. We
154ab31d003SDmitri Gribenkorepresent this fact symbolically as `⊥` (pronounced "bottom").
155ab31d003SDmitri Gribenko
156ab31d003SDmitri Gribenko```c++
157ab31d003SDmitri Gribenkovoid ExampleOfBottom() {
158ab31d003SDmitri Gribenko  int x;    // x is ⊥
159ab31d003SDmitri Gribenko  x = 42;   // x is {42}
160ab31d003SDmitri Gribenko  print(x);
161ab31d003SDmitri Gribenko}
162ab31d003SDmitri Gribenko```
163ab31d003SDmitri Gribenko
164ab31d003SDmitri GribenkoNote that using values read from uninitialized variables is undefined behaviour
165ab31d003SDmitri Gribenkoin C++. Generally, compilers and static analysis tools can assume undefined
166ab31d003SDmitri Gribenkobehavior does not happen. We must model uninitialized variables only when we are
167ab31d003SDmitri Gribenkoimplementing a checker that specifically is trying to find uninitialized reads.
168ab31d003SDmitri GribenkoIn this example we show how to model uninitialized variables only to demonstrate
169ab31d003SDmitri Gribenkothe concept of "bottom", and how it applies to possible value analysis. We
170ab31d003SDmitri Gribenkodescribe an analysis that finds uninitialized reads in a section below.
171ab31d003SDmitri Gribenko
172ab31d003SDmitri Gribenko### A practical lattice that tracks sets of concrete values
173ab31d003SDmitri Gribenko
174ab31d003SDmitri GribenkoTaking into account all corner cases covered above, we can put together a
175ab31d003SDmitri Gribenkolattice that we can use in practice to track possible values of integer
176ab31d003SDmitri Gribenkovariables. This lattice represents sets of integers with 1, 2, or 3 elements, as
177ab31d003SDmitri Gribenkowell as top and bottom. Here is a Hasse diagram for it:
178ab31d003SDmitri Gribenko
179ab31d003SDmitri Gribenko![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)
180ab31d003SDmitri Gribenko
181ab31d003SDmitri Gribenko### Formalization
182ab31d003SDmitri Gribenko
183ab31d003SDmitri GribenkoLet's consider a slightly more complex example, and think about how we can
184ab31d003SDmitri Gribenkocompute the sets of possible values algorithmically.
185ab31d003SDmitri Gribenko
186ab31d003SDmitri Gribenko```c++
187ab31d003SDmitri Gribenkovoid Example(int n) {
188ab31d003SDmitri Gribenko  int x;          // x is ⊥
189ab31d003SDmitri Gribenko  if (n > 0) {
190ab31d003SDmitri Gribenko    if (n == 42) {
191ab31d003SDmitri Gribenko       x = 44;    // x is {44}
192ab31d003SDmitri Gribenko    } else {
193ab31d003SDmitri Gribenko       x = 5;     // x is {5}
194ab31d003SDmitri Gribenko    }
195ab31d003SDmitri Gribenko    print(x);     // x is {44; 5}
196ab31d003SDmitri Gribenko  } else {
197ab31d003SDmitri Gribenko    x = n;        // x is ⊤
198ab31d003SDmitri Gribenko  }
199ab31d003SDmitri Gribenko  print(x);       // x is ⊤
200ab31d003SDmitri Gribenko}
201ab31d003SDmitri Gribenko```
202ab31d003SDmitri Gribenko
203ab31d003SDmitri GribenkoAs humans, we understand the control flow from the program text. We used our
204ab31d003SDmitri Gribenkounderstanding of control flow to find program points where two flows join.
205ab31d003SDmitri GribenkoFormally, control flow is represented by a CFG (control flow graph):
206ab31d003SDmitri Gribenko
207ab31d003SDmitri Gribenko![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)
208ab31d003SDmitri Gribenko
209ab31d003SDmitri GribenkoWe can compute sets of possible values by propagating them through the CFG of
210ab31d003SDmitri Gribenkothe function:
211ab31d003SDmitri Gribenko
212ab31d003SDmitri Gribenko*   When `x` is declared but not initialized, its possible values are `{}`. The
213ab31d003SDmitri Gribenko    empty set plays the role of `⊥` in this lattice.
214ab31d003SDmitri Gribenko
215ab31d003SDmitri Gribenko*   When `x` is assigned a concrete value, its possible set of values contains
216ab31d003SDmitri Gribenko    just that specific value.
217ab31d003SDmitri Gribenko
218ab31d003SDmitri Gribenko*   When `x` is assigned some unknown value, it can have any value. We represent
219ab31d003SDmitri Gribenko    this fact as `⊤`.
220ab31d003SDmitri Gribenko
221ab31d003SDmitri Gribenko*   When two control flow paths join, we compute the set union of incoming
222*cc8237d9SAyushi Shukla    values (limiting the number of elements to 3, representing larger sets as
223ab31d003SDmitri Gribenko    `⊤`).
224ab31d003SDmitri Gribenko
225ab31d003SDmitri GribenkoThe sets of possible values are influenced by:
226ab31d003SDmitri Gribenko
227ab31d003SDmitri Gribenko*   Statements, for example, assignments.
228ab31d003SDmitri Gribenko
229ab31d003SDmitri Gribenko*   Joins in control flow, for example, ones that appear at the end of "if"
230ab31d003SDmitri Gribenko    statements.
231ab31d003SDmitri Gribenko
232ab31d003SDmitri Gribenko**Effects of statements** are modeled by what is formally known as a transfer
233ab31d003SDmitri Gribenkofunction. A transfer function takes two arguments: the statement, and the state
234ab31d003SDmitri Gribenkoof `x` at the previous program point. It produces the state of `x` at the next
235ab31d003SDmitri Gribenkoprogram point. For example, the transfer function for assignment ignores the
236ab31d003SDmitri Gribenkostate at the previous program point:
237ab31d003SDmitri Gribenko
238ab31d003SDmitri Gribenko```c++
239ab31d003SDmitri Gribenko// GIVEN: x is {42; 44}
240ab31d003SDmitri Gribenkox = 0;
241ab31d003SDmitri Gribenko// CONCLUSION: x is {0}
242ab31d003SDmitri Gribenko```
243ab31d003SDmitri Gribenko
244ab31d003SDmitri GribenkoThe transfer function for `+` performs arithmetic on every set member:
245ab31d003SDmitri Gribenko
246ab31d003SDmitri Gribenko```c++
247ab31d003SDmitri Gribenko// GIVEN: x is {42, 44}
248ab31d003SDmitri Gribenkox = x + 100;
249ab31d003SDmitri Gribenko// CONCLUSION: x is {142, 144}
250ab31d003SDmitri Gribenko```
251ab31d003SDmitri Gribenko
252ab31d003SDmitri Gribenko**Effects of control flow** are modeled by joining the knowledge from all
253ab31d003SDmitri Gribenkopossible previous program points.
254ab31d003SDmitri Gribenko
255ab31d003SDmitri Gribenko```c++
256ab31d003SDmitri Gribenkoif (...) {
257ab31d003SDmitri Gribenko  ...
258ab31d003SDmitri Gribenko  // GIVEN: x is {42}
259ab31d003SDmitri Gribenko} else {
260ab31d003SDmitri Gribenko  ...
261ab31d003SDmitri Gribenko  // GIVEN: x is {44}
262ab31d003SDmitri Gribenko}
263ab31d003SDmitri Gribenko// CONCLUSION: x is {42; 44}
264ab31d003SDmitri Gribenko```
265ab31d003SDmitri Gribenko
266ab31d003SDmitri Gribenko```c++
267ab31d003SDmitri Gribenko// GIVEN: x is {42}
268ab31d003SDmitri Gribenkowhile (...) {
269ab31d003SDmitri Gribenko  ...
270ab31d003SDmitri Gribenko  // GIVEN: x is {44}
271ab31d003SDmitri Gribenko}
272ab31d003SDmitri Gribenko// CONCLUSION: {42; 44}
273ab31d003SDmitri Gribenko```
274ab31d003SDmitri Gribenko
275ab31d003SDmitri GribenkoThe predicate that we marked "given" is usually called a precondition, and the
276ab31d003SDmitri Gribenkoconclusion is called a postcondition.
277ab31d003SDmitri Gribenko
278ab31d003SDmitri GribenkoIn terms of the CFG, we join the information from all predecessor basic blocks.
279ab31d003SDmitri Gribenko
280ab31d003SDmitri Gribenko![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)
281ab31d003SDmitri Gribenko
282ab31d003SDmitri GribenkoPutting it all together, to model the effects of a basic block we compute:
283ab31d003SDmitri Gribenko
284ab31d003SDmitri Gribenko```
285ab31d003SDmitri Gribenkoout = transfer(basic_block, join(in_1, in_2, ..., in_n))
286ab31d003SDmitri Gribenko```
287ab31d003SDmitri Gribenko
288ab31d003SDmitri Gribenko(Note that there are other ways to write this equation that produce higher
289ab31d003SDmitri Gribenkoprecision analysis results. The trick is to keep exploring the execution paths
29090a6e354SphyBracketsseparately and delay joining until later. However, we won't discuss those
291ab31d003SDmitri Gribenkovariations here.)
292ab31d003SDmitri Gribenko
293ab31d003SDmitri GribenkoTo make a conclusion about all paths through the program, we repeat this
294ab31d003SDmitri Gribenkocomputation on all basic blocks until we reach a fixpoint. In other words, we
295ab31d003SDmitri Gribenkokeep propagating information through the CFG until the computed sets of values
296ab31d003SDmitri Gribenkostop changing.
297ab31d003SDmitri Gribenko
298ab31d003SDmitri GribenkoIf the lattice has a finite height and transfer functions are monotonic the
299ab31d003SDmitri Gribenkoalgorithm is guaranteed to terminate.  Each iteration of the algorithm can
300ab31d003SDmitri Gribenkochange computed values only to larger values from the lattice. In the worst
301ab31d003SDmitri Gribenkocase, all computed values become `⊤`, which is not very useful, but at least the
302ab31d003SDmitri Gribenkoanalysis terminates at that point, because it can't change any of the values.
303ab31d003SDmitri Gribenko
304ab31d003SDmitri GribenkoFixpoint iteration can be optimised by only reprocessing basic blocks which had
305ab31d003SDmitri Gribenkoone of their inputs changed on the previous iteration. This is typically
306ab31d003SDmitri Gribenkoimplemented using a worklist queue. With this optimisation the time complexity
307ab31d003SDmitri Gribenkobecomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
308ab31d003SDmitri Gribenko`|L|` is the size of lattice used by the analysis.
309ab31d003SDmitri Gribenko
310ab31d003SDmitri Gribenko## Symbolic execution: a very short informal introduction
311ab31d003SDmitri Gribenko
312ab31d003SDmitri Gribenko### Symbolic values
313ab31d003SDmitri Gribenko
314ab31d003SDmitri GribenkoIn the previous example where we tried to figure out what values a variable can
315ab31d003SDmitri Gribenkohave, the analysis had to be seeded with a concrete value. What if there are no
316ab31d003SDmitri Gribenkoassignments of concrete values in the program? We can still deduce some
317ab31d003SDmitri Gribenkointeresting information by representing unknown input values symbolically, and
318ab31d003SDmitri Gribenkocomputing results as symbolic expressions:
319ab31d003SDmitri Gribenko
320ab31d003SDmitri Gribenko```c++
321ab31d003SDmitri Gribenkovoid PrintAbs(int x) {
322ab31d003SDmitri Gribenko  int result;
323ab31d003SDmitri Gribenko  if (x >= 0) {
324ab31d003SDmitri Gribenko    result = x;   // result is {x}
325ab31d003SDmitri Gribenko  } else {
326ab31d003SDmitri Gribenko    result = -x;  // result is {-x}
327ab31d003SDmitri Gribenko  }
328ab31d003SDmitri Gribenko  print(result);  // result is {x; -x}
329ab31d003SDmitri Gribenko}
330ab31d003SDmitri Gribenko```
331ab31d003SDmitri Gribenko
332ab31d003SDmitri GribenkoWe can't say what specific value gets printed, but we know that it is either `x`
333ab31d003SDmitri Gribenkoor `-x`.
334ab31d003SDmitri Gribenko
335*cc8237d9SAyushi ShuklaDataflow analysis is an instance of abstract interpretation, and does not dictate
336ab31d003SDmitri Gribenkohow exactly the lattice and transfer functions should be designed, beyond the
337ab31d003SDmitri Gribenkonecessary conditions for the analysis to converge. Nevertheless, we can use
338ab31d003SDmitri Gribenkosymbolic execution ideas to guide our design of the lattice and transfer
339ab31d003SDmitri Gribenkofunctions: lattice values can be symbolic expressions, and transfer functions
340ab31d003SDmitri Gribenkocan construct more complex symbolic expressions from symbolic expressions that
341ab31d003SDmitri Gribenkorepresent arguments. See [this StackOverflow
342ab31d003SDmitri Gribenkodiscussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
343ab31d003SDmitri Gribenkofor a further comparison of abstract interpretation and symbolic execution.
344ab31d003SDmitri Gribenko
345ab31d003SDmitri Gribenko### Flow condition
346ab31d003SDmitri Gribenko
347ab31d003SDmitri GribenkoA human can say about the previous example that the function returns `x` when
348ab31d003SDmitri Gribenko`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
349ab31d003SDmitri Gribenkotracking a flow condition. A flow condition is a predicate written in terms of
350ab31d003SDmitri Gribenkothe program state that is true at a specific program point regardless of the
351ab31d003SDmitri Gribenkoexecution path that led to this statement. For example, the flow condition for
352ab31d003SDmitri Gribenkothe program point right before evaluating `result = x` is `x >= 0`.
353ab31d003SDmitri Gribenko
354ab31d003SDmitri GribenkoIf we enhance the lattice to be a set of pairs of values and predicates, the
355ab31d003SDmitri Gribenkodataflow analysis computes the following values:
356ab31d003SDmitri Gribenko
357ab31d003SDmitri Gribenko```c++
358ab31d003SDmitri Gribenkovoid PrintAbs(int x) {
359ab31d003SDmitri Gribenko  int result;
360ab31d003SDmitri Gribenko  if (x >= 0) {
361ab31d003SDmitri Gribenko    // Flow condition: x >= 0.
362ab31d003SDmitri Gribenko    result = x;   // result is {x if x >= 0}
363ab31d003SDmitri Gribenko  } else {
364ab31d003SDmitri Gribenko    // Flow condition: x < 0.
365ab31d003SDmitri Gribenko    result = -x;  // result is {-x if x < 0}
366ab31d003SDmitri Gribenko  }
367ab31d003SDmitri Gribenko  print(result);  // result is {x if x >= 0; -x if x < 0}
368ab31d003SDmitri Gribenko}
369ab31d003SDmitri Gribenko```
370ab31d003SDmitri Gribenko
371ab31d003SDmitri GribenkoOf course, in a program with loops, symbolic expressions for flow conditions can
372ab31d003SDmitri Gribenkogrow unbounded. A practical static analysis system must control this growth to
373ab31d003SDmitri Gribenkokeep the symbolic representations manageable and ensure that the data flow
374ab31d003SDmitri Gribenkoanalysis terminates. For example, it can use a constraint solver to prune
375ab31d003SDmitri Gribenkoimpossible flow conditions, and/or it can abstract them, losing precision, after
376ab31d003SDmitri Gribenkotheir symbolic representations grow beyond some threshold. This is similar to
377ab31d003SDmitri Gribenkohow we had to limit the sizes of computed sets of possible values to 3 elements.
378ab31d003SDmitri Gribenko
379ab31d003SDmitri Gribenko### Symbolic pointers
380ab31d003SDmitri Gribenko
381ab31d003SDmitri GribenkoThis approach proves to be particularly useful for modeling pointer values,
382ab31d003SDmitri Gribenkosince we don't care about specific addresses but just want to give a unique
383ab31d003SDmitri Gribenkoidentifier to a memory location.
384ab31d003SDmitri Gribenko
385ab31d003SDmitri Gribenko```c++
386ab31d003SDmitri Gribenkovoid ExampleOfSymbolicPointers(bool b) {
387ab31d003SDmitri Gribenko  int x = 0;     // x is {0}
388ab31d003SDmitri Gribenko  int* ptr = &x; // x is {0}      ptr is {&x}
389ab31d003SDmitri Gribenko  if (b) {
390ab31d003SDmitri Gribenko    *ptr = 42;   // x is {42}     ptr is {&x}
391ab31d003SDmitri Gribenko  }
392ab31d003SDmitri Gribenko  print(x);      // x is {0; 42}  ptr is {&x}
393ab31d003SDmitri Gribenko}
394ab31d003SDmitri Gribenko```
395ab31d003SDmitri Gribenko
396ab31d003SDmitri Gribenko## Example: finding output parameters
397ab31d003SDmitri Gribenko
398ab31d003SDmitri GribenkoLet's explore how data flow analysis can help with a problem that is hard to
399ab31d003SDmitri Gribenkosolve with other tools in Clang.
400ab31d003SDmitri Gribenko
401ab31d003SDmitri Gribenko### Problem description
402ab31d003SDmitri Gribenko
403ab31d003SDmitri GribenkoOutput parameters are function parameters of pointer or reference type whose
404ab31d003SDmitri Gribenkopointee is completely overwritten by the function, and not read before it is
405ab31d003SDmitri Gribenkooverwritten. They are common in pre-C++11 code due to the absence of move
406ab31d003SDmitri Gribenkosemantics. In modern C++ output parameters are non-idiomatic, and return values
407ab31d003SDmitri Gribenkoare used instead.
408ab31d003SDmitri Gribenko
409ab31d003SDmitri GribenkoImagine that we would like to refactor output parameters to return values to
410ab31d003SDmitri Gribenkomodernize old code. The first step is to identify refactoring candidates through
411ab31d003SDmitri Gribenkostatic analysis.
412ab31d003SDmitri Gribenko
413ab31d003SDmitri GribenkoFor example, in the following code snippet the pointer `c` is an output
414ab31d003SDmitri Gribenkoparameter:
415ab31d003SDmitri Gribenko
416ab31d003SDmitri Gribenko```c++
417ab31d003SDmitri Gribenkostruct Customer {
418ab31d003SDmitri Gribenko  int account_id;
419ab31d003SDmitri Gribenko  std::string name;
420ab31d003SDmitri Gribenko}
421ab31d003SDmitri Gribenko
422ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) {
423ab31d003SDmitri Gribenko  c->account_id = ...;
424ab31d003SDmitri Gribenko  if (...) {
425ab31d003SDmitri Gribenko    c->name = ...;
426ab31d003SDmitri Gribenko  } else {
427ab31d003SDmitri Gribenko    c->name = ...;
428ab31d003SDmitri Gribenko  }
429ab31d003SDmitri Gribenko}
430ab31d003SDmitri Gribenko```
431ab31d003SDmitri Gribenko
432ab31d003SDmitri GribenkoWe would like to refactor this code into:
433ab31d003SDmitri Gribenko
434ab31d003SDmitri Gribenko```c++
435ab31d003SDmitri GribenkoCustomer GetCustomer() {
436ab31d003SDmitri Gribenko  Customer c;
437ab31d003SDmitri Gribenko  c.account_id = ...;
438ab31d003SDmitri Gribenko  if (...) {
439ab31d003SDmitri Gribenko    c.name = ...;
440ab31d003SDmitri Gribenko  } else {
441ab31d003SDmitri Gribenko    c.name = ...;
442ab31d003SDmitri Gribenko  }
443ab31d003SDmitri Gribenko  return c;
444ab31d003SDmitri Gribenko}
445ab31d003SDmitri Gribenko```
446ab31d003SDmitri Gribenko
447ab31d003SDmitri GribenkoHowever, in the function below the parameter `c` is not an output parameter
448ab31d003SDmitri Gribenkobecause its field `name` is not overwritten on every path through the function.
449ab31d003SDmitri Gribenko
450ab31d003SDmitri Gribenko```c++
451ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) {
452ab31d003SDmitri Gribenko  c->account_id = ...;
453ab31d003SDmitri Gribenko  if (...) {
454ab31d003SDmitri Gribenko    c->name = ...;
455ab31d003SDmitri Gribenko  }
456ab31d003SDmitri Gribenko}
457ab31d003SDmitri Gribenko```
458ab31d003SDmitri Gribenko
459ab31d003SDmitri GribenkoThe code also cannot read the value of the parameter before overwriting it:
460ab31d003SDmitri Gribenko
461ab31d003SDmitri Gribenko```c++
462ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) {
463ab31d003SDmitri Gribenko  use(c->account_id);
464ab31d003SDmitri Gribenko  c->name = ...;
465ab31d003SDmitri Gribenko  c->account_id = ...;
466ab31d003SDmitri Gribenko}
467ab31d003SDmitri Gribenko```
468ab31d003SDmitri Gribenko
469ab31d003SDmitri GribenkoFunctions that escape the pointer also block the refactoring:
470ab31d003SDmitri Gribenko
471ab31d003SDmitri Gribenko```c++
472ab31d003SDmitri GribenkoCustomer* kGlobalCustomer;
473ab31d003SDmitri Gribenko
474ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) {
475ab31d003SDmitri Gribenko  c->name = ...;
476ab31d003SDmitri Gribenko  c->account_id = ...;
477ab31d003SDmitri Gribenko  kGlobalCustomer = c;
478ab31d003SDmitri Gribenko}
479ab31d003SDmitri Gribenko```
480ab31d003SDmitri Gribenko
481ab31d003SDmitri GribenkoTo identify a candidate function for refactoring, we need to do the following:
482ab31d003SDmitri Gribenko
483ab31d003SDmitri Gribenko*   Find a function with a non-const pointer or reference parameter.
484ab31d003SDmitri Gribenko
485ab31d003SDmitri Gribenko*   Find the definition of that function.
486ab31d003SDmitri Gribenko
487ab31d003SDmitri Gribenko*   Prove that the function completely overwrites the pointee on all paths
488ab31d003SDmitri Gribenko    before returning.
489ab31d003SDmitri Gribenko
490ab31d003SDmitri Gribenko*   Prove that the function reads the pointee only after overwriting it.
491ab31d003SDmitri Gribenko
492ab31d003SDmitri Gribenko*   Prove that the function does not persist the pointer in a data structure
493ab31d003SDmitri Gribenko    that is live after the function returns.
494ab31d003SDmitri Gribenko
495ab31d003SDmitri GribenkoThere are also requirements that all usage sites of the candidate function must
496ab31d003SDmitri Gribenkosatisfy, for example, that function arguments do not alias, that users are not
497ab31d003SDmitri Gribenkotaking the address of the function, and so on. Let's consider verifying usage
498ab31d003SDmitri Gribenkosite conditions to be a separate static analysis problem.
499ab31d003SDmitri Gribenko
500ab31d003SDmitri Gribenko### Lattice design
501ab31d003SDmitri Gribenko
502ab31d003SDmitri GribenkoTo analyze the function body we can use a lattice which consists of normal
503ab31d003SDmitri Gribenkostates and failure states. A normal state describes program points where we are
504ab31d003SDmitri Gribenkosure that no behaviors that block the refactoring have occurred. Normal states
505ab31d003SDmitri Gribenkokeep track of all parameter's member fields that are known to be overwritten on
506ab31d003SDmitri Gribenkoevery path from function entry to the corresponding program point. Failure
507ab31d003SDmitri Gribenkostates accumulate observed violations (unsafe reads and pointer escapes) that
508ab31d003SDmitri Gribenkoblock the refactoring.
509ab31d003SDmitri Gribenko
510ab31d003SDmitri GribenkoIn the partial order of the lattice failure states compare greater than normal
511ab31d003SDmitri Gribenkostates, which guarantees that they "win" when joined with normal states. Order
512ab31d003SDmitri Gribenkobetween failure states is determined by inclusion relation on the set of
513ab31d003SDmitri Gribenkoaccumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
514ab31d003SDmitri Gribenkobetween normal states is determined by reversed inclusion relation on the set of
515ab31d003SDmitri Gribenkooverwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
516ab31d003SDmitri Gribenkooverwritten fields).
517ab31d003SDmitri Gribenko
518ab31d003SDmitri Gribenko![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)
519ab31d003SDmitri Gribenko
520ab31d003SDmitri GribenkoTo determine whether a statement reads or writes a field we can implement
521ab31d003SDmitri Gribenkosymbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
522ab31d003SDmitri Gribenkodereference operator and `MemberExpr`s.
523ab31d003SDmitri Gribenko
524ab31d003SDmitri Gribenko### Using data flow results to identify output parameters
525ab31d003SDmitri Gribenko
526ab31d003SDmitri GribenkoLet's take a look at how we use data flow analysis to identify an output
527ab31d003SDmitri Gribenkoparameter. The refactoring can be safely done when the data flow algorithm
528ab31d003SDmitri Gribenkocomputes a normal state with all of the fields proven to be overwritten in the
529ab31d003SDmitri Gribenkoexit basic block of the function.
530ab31d003SDmitri Gribenko
531ab31d003SDmitri Gribenko```c++
532ab31d003SDmitri Gribenkostruct Customer {
533ab31d003SDmitri Gribenko  int account_id;
534ab31d003SDmitri Gribenko  std::string name;
535ab31d003SDmitri Gribenko};
536ab31d003SDmitri Gribenko
537ab31d003SDmitri Gribenkovoid GetCustomer(Customer* c) {
538ab31d003SDmitri Gribenko  // Overwritten: {}
539ab31d003SDmitri Gribenko  c->account_id = ...; // Overwritten: {c->account_id}
540ab31d003SDmitri Gribenko  if (...) {
541ab31d003SDmitri Gribenko    c->name = ...;     // Overwritten: {c->account_id, c->name}
542ab31d003SDmitri Gribenko  } else {
543ab31d003SDmitri Gribenko    c->name = ...;     // Overwritten: {c->account_id, c->name}
544ab31d003SDmitri Gribenko  }
545ab31d003SDmitri Gribenko  // Overwritten: {c->account_id, c->name}
546ab31d003SDmitri Gribenko}
547ab31d003SDmitri Gribenko```
548ab31d003SDmitri Gribenko
549ab31d003SDmitri GribenkoWhen the data flow algorithm computes a normal state, but not all fields are
550ab31d003SDmitri Gribenkoproven to be overwritten we can't perform the refactoring.
551ab31d003SDmitri Gribenko
552ab31d003SDmitri Gribenko```c++
553ab31d003SDmitri Gribenkovoid target(bool b, Customer* c) {
554ab31d003SDmitri Gribenko  // Overwritten: {}
555ab31d003SDmitri Gribenko  if (b) {
556ab31d003SDmitri Gribenko    c->account_id = 42;     // Overwritten: {c->account_id}
557ab31d003SDmitri Gribenko  } else {
558ab31d003SDmitri Gribenko    c->name = "Konrad";  // Overwritten: {c->name}
559ab31d003SDmitri Gribenko  }
560ab31d003SDmitri Gribenko  // Overwritten: {}
561ab31d003SDmitri Gribenko}
562ab31d003SDmitri Gribenko```
563ab31d003SDmitri Gribenko
564ab31d003SDmitri GribenkoSimilarly, when the data flow algorithm computes a failure state, we also can't
565ab31d003SDmitri Gribenkoperform the refactoring.
566ab31d003SDmitri Gribenko
567ab31d003SDmitri Gribenko```c++
568ab31d003SDmitri GribenkoCustomer* kGlobalCustomer;
569ab31d003SDmitri Gribenko
570ab31d003SDmitri Gribenkovoid GetCustomer(Customer* c) {
571ab31d003SDmitri Gribenko  // Overwritten: {}
572ab31d003SDmitri Gribenko  c->account_id = ...;    // Overwritten: {c->account_id}
573ab31d003SDmitri Gribenko  if (...) {
574ab31d003SDmitri Gribenko    print(c->name);       // Unsafe read
575ab31d003SDmitri Gribenko  } else {
576ab31d003SDmitri Gribenko    kGlobalCustomer = c;  // Pointer escape
577ab31d003SDmitri Gribenko  }
578ab31d003SDmitri Gribenko  // Unsafe read, Pointer escape
579ab31d003SDmitri Gribenko}
580ab31d003SDmitri Gribenko```
581ab31d003SDmitri Gribenko
582ab31d003SDmitri Gribenko## Example: finding dead stores
583ab31d003SDmitri Gribenko
584ab31d003SDmitri GribenkoLet's say we want to find redundant stores, because they indicate potential
585ab31d003SDmitri Gribenkobugs.
586ab31d003SDmitri Gribenko
587ab31d003SDmitri Gribenko```c++
588ab31d003SDmitri Gribenkox = GetX();
589ab31d003SDmitri Gribenkox = GetY();
590ab31d003SDmitri Gribenko```
591ab31d003SDmitri Gribenko
592ab31d003SDmitri GribenkoThe first store to `x` is never read, probably there is a bug.
593ab31d003SDmitri Gribenko
594ab31d003SDmitri GribenkoThe implementation of dead store analysis is very similar to output parameter
595ab31d003SDmitri Gribenkoanalysis: we need to track stores and loads, and find stores that were never
596ab31d003SDmitri Gribenkoread.
597ab31d003SDmitri Gribenko
598ab31d003SDmitri Gribenko[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
599ab31d003SDmitri Gribenkogeneralization of this idea, which is often used to answer many related
600ab31d003SDmitri Gribenkoquestions, for example:
601ab31d003SDmitri Gribenko
602ab31d003SDmitri Gribenko* finding dead stores,
603ab31d003SDmitri Gribenko* finding uninitialized variables,
604ab31d003SDmitri Gribenko* finding a good point to deallocate memory,
605ab31d003SDmitri Gribenko* finding out if it would be safe to move an object.
606ab31d003SDmitri Gribenko
607ab31d003SDmitri Gribenko## Example: definitive initialization
608ab31d003SDmitri Gribenko
609ab31d003SDmitri GribenkoDefinitive initialization proves that variables are known to be initialized when
610ab31d003SDmitri Gribenkoread. If we find a variable which is read when not initialized then we generate
611ab31d003SDmitri Gribenkoa warning.
612ab31d003SDmitri Gribenko
613ab31d003SDmitri Gribenko```c++
614ab31d003SDmitri Gribenkovoid Init() {
615ab31d003SDmitri Gribenko  int x;    // x is uninitialized
616ab31d003SDmitri Gribenko  if (cond()) {
617ab31d003SDmitri Gribenko    x = 10; // x is initialized
618ab31d003SDmitri Gribenko  } else {
619ab31d003SDmitri Gribenko    x = 20; // x is initialized
620ab31d003SDmitri Gribenko  }
621ab31d003SDmitri Gribenko  print(x); // x is initialized
622ab31d003SDmitri Gribenko}
623ab31d003SDmitri Gribenko```
624ab31d003SDmitri Gribenko
625ab31d003SDmitri Gribenko```c++
626ab31d003SDmitri Gribenkovoid Uninit() {
627ab31d003SDmitri Gribenko  int x;    // x is uninitialized
628ab31d003SDmitri Gribenko  if (cond()) {
629ab31d003SDmitri Gribenko    x = 10; // x is initialized
630ab31d003SDmitri Gribenko  }
631ab31d003SDmitri Gribenko  print(x); // x is maybe uninitialized, x is being read, report a bug.
632ab31d003SDmitri Gribenko}
633ab31d003SDmitri Gribenko```
634ab31d003SDmitri Gribenko
635ab31d003SDmitri GribenkoFor this purpose we can use lattice in a form of a mapping from variable
636ab31d003SDmitri Gribenkodeclarations to initialization states; each initialization state is represented
637*cc8237d9SAyushi Shuklaby the following lattice:
638ab31d003SDmitri Gribenko
639ab31d003SDmitri Gribenko![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)
640ab31d003SDmitri Gribenko
641ab31d003SDmitri GribenkoA lattice element could also capture the source locations of the branches that
642ab31d003SDmitri Gribenkolead us to the corresponding program point. Diagnostics would use this
643ab31d003SDmitri Gribenkoinformation to show a sample buggy code path to the user.
644ab31d003SDmitri Gribenko
645ab31d003SDmitri Gribenko## Example: refactoring raw pointers to `unique_ptr`
646ab31d003SDmitri Gribenko
647ab31d003SDmitri GribenkoModern idiomatic C++ uses smart pointers to express memory ownership, however in
648ab31d003SDmitri Gribenkopre-C++11 code one can often find raw pointers that own heap memory blocks.
649ab31d003SDmitri Gribenko
650ab31d003SDmitri GribenkoImagine that we would like to refactor raw pointers that own memory to
651ab31d003SDmitri Gribenko`unique_ptr`. There are multiple ways to design a data flow analysis for this
652ab31d003SDmitri Gribenkoproblem; let's look at one way to do it.
653ab31d003SDmitri Gribenko
654ab31d003SDmitri GribenkoFor example, we would like to refactor the following code that uses raw
655ab31d003SDmitri Gribenkopointers:
656ab31d003SDmitri Gribenko
657ab31d003SDmitri Gribenko```c++
658ab31d003SDmitri Gribenkovoid UniqueOwnership1() {
659ab31d003SDmitri Gribenko  int *pi = new int;
660ab31d003SDmitri Gribenko  if (...) {
661ab31d003SDmitri Gribenko    Borrow(pi);
662ab31d003SDmitri Gribenko    delete pi;
663ab31d003SDmitri Gribenko  } else {
664ab31d003SDmitri Gribenko    TakeOwnership(pi);
665ab31d003SDmitri Gribenko  }
666ab31d003SDmitri Gribenko}
667ab31d003SDmitri Gribenko```
668ab31d003SDmitri Gribenko
669ab31d003SDmitri Gribenkointo code that uses `unique_ptr`:
670ab31d003SDmitri Gribenko
671ab31d003SDmitri Gribenko```c++
672ab31d003SDmitri Gribenkovoid UniqueOwnership1() {
673ab31d003SDmitri Gribenko  auto pi = std::make_unique<int>();
674ab31d003SDmitri Gribenko  if (...) {
675ab31d003SDmitri Gribenko    Borrow(pi.get());
676ab31d003SDmitri Gribenko  } else {
677ab31d003SDmitri Gribenko    TakeOwnership(pi.release());
678ab31d003SDmitri Gribenko  }
679ab31d003SDmitri Gribenko}
680ab31d003SDmitri Gribenko```
681ab31d003SDmitri Gribenko
682ab31d003SDmitri GribenkoThis problem can be solved with a lattice in form of map from value declarations
683ab31d003SDmitri Gribenkoto pointer states:
684ab31d003SDmitri Gribenko
685ab31d003SDmitri Gribenko![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)
686ab31d003SDmitri Gribenko
687ab31d003SDmitri GribenkoWe can perform the refactoring if at the exit of a function `pi` is
688ab31d003SDmitri Gribenko`Compatible`.
689ab31d003SDmitri Gribenko
690ab31d003SDmitri Gribenko```c++
691ab31d003SDmitri Gribenkovoid UniqueOwnership1() {
692ab31d003SDmitri Gribenko  int *pi;             // pi is Compatible
693ab31d003SDmitri Gribenko  pi = new int;        // pi is Defined
694ab31d003SDmitri Gribenko  if (...) {
695ab31d003SDmitri Gribenko    Borrow(pi);        // pi is Defined
696ab31d003SDmitri Gribenko    delete pi;         // pi is Compatible
697ab31d003SDmitri Gribenko  } else {
698ab31d003SDmitri Gribenko    TakeOwnership(pi); // pi is Compatible
699ab31d003SDmitri Gribenko  }
700ab31d003SDmitri Gribenko  // pi is Compatible
701ab31d003SDmitri Gribenko}
702ab31d003SDmitri Gribenko```
703ab31d003SDmitri Gribenko
704ab31d003SDmitri GribenkoLet's look at an example where the raw pointer owns two different memory blocks:
705ab31d003SDmitri Gribenko
706ab31d003SDmitri Gribenko```c++
707ab31d003SDmitri Gribenkovoid UniqueOwnership2() {
708ab31d003SDmitri Gribenko  int *pi = new int;  // pi is Defined
709ab31d003SDmitri Gribenko  Borrow(pi);
710ab31d003SDmitri Gribenko  delete pi;          // pi is Compatible
711ab31d003SDmitri Gribenko  if (smth) {
712ab31d003SDmitri Gribenko    pi = new int;     // pi is Defined
713ab31d003SDmitri Gribenko    Borrow(pi);
714ab31d003SDmitri Gribenko    delete pi;        // pi is Compatible
715ab31d003SDmitri Gribenko  }
716ab31d003SDmitri Gribenko  // pi is Compatible
717ab31d003SDmitri Gribenko}
718ab31d003SDmitri Gribenko```
719ab31d003SDmitri Gribenko
720ab31d003SDmitri GribenkoIt can be refactored to use `unique_ptr` like this:
721ab31d003SDmitri Gribenko
722ab31d003SDmitri Gribenko```c++
723ab31d003SDmitri Gribenkovoid UniqueOwnership2() {
724ab31d003SDmitri Gribenko  auto pi = make_unique<int>();
725ab31d003SDmitri Gribenko  Borrow(pi);
726ab31d003SDmitri Gribenko  if (smth) {
727ab31d003SDmitri Gribenko    pi = make_unique<int>();
728ab31d003SDmitri Gribenko    Borrow(pi);
729ab31d003SDmitri Gribenko  }
730ab31d003SDmitri Gribenko}
731ab31d003SDmitri Gribenko```
732ab31d003SDmitri Gribenko
733ab31d003SDmitri GribenkoIn the following example, the raw pointer is used to access the heap object
734ab31d003SDmitri Gribenkoafter the ownership has been transferred.
735ab31d003SDmitri Gribenko
736ab31d003SDmitri Gribenko```c++
737ab31d003SDmitri Gribenkovoid UniqueOwnership3() {
738ab31d003SDmitri Gribenko  int *pi = new int; // pi is Defined
739ab31d003SDmitri Gribenko  if (...) {
740ab31d003SDmitri Gribenko    Borrow(pi);
741ab31d003SDmitri Gribenko    delete pi;       // pi is Compatible
742ab31d003SDmitri Gribenko  } else {
743ab31d003SDmitri Gribenko    vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
744ab31d003SDmitri Gribenko    print(*pi);
745ab31d003SDmitri Gribenko    use(v);
746ab31d003SDmitri Gribenko  }
747ab31d003SDmitri Gribenko  // pi is Compatible
748ab31d003SDmitri Gribenko}
749ab31d003SDmitri Gribenko```
750ab31d003SDmitri Gribenko
751ab31d003SDmitri GribenkoWe can refactor this code to use `unique_ptr`, however we would have to
752ab31d003SDmitri Gribenkointroduce a non-owning pointer variable, since we can't use the moved-from
753ab31d003SDmitri Gribenko`unique_ptr` to access the object:
754ab31d003SDmitri Gribenko
755ab31d003SDmitri Gribenko```c++
756ab31d003SDmitri Gribenkovoid UniqueOwnership3() {
757ab31d003SDmitri Gribenko  std::unique_ptr<int> pi = std::make_unique<int>();
758ab31d003SDmitri Gribenko  if (...) {
759ab31d003SDmitri Gribenko    Borrow(pi);
760ab31d003SDmitri Gribenko  } else {
761ab31d003SDmitri Gribenko    int *pi_non_owning = pi.get();
762ab31d003SDmitri Gribenko    vector<unique_ptr<int>> v = {std::move(pi)};
763ab31d003SDmitri Gribenko    print(*pi_non_owning);
764ab31d003SDmitri Gribenko    use(v);
765ab31d003SDmitri Gribenko  }
766ab31d003SDmitri Gribenko}
767ab31d003SDmitri Gribenko```
768ab31d003SDmitri Gribenko
769ab31d003SDmitri GribenkoIf the original code didn't call `delete` at the very end of the function, then
770ab31d003SDmitri Gribenkoour refactoring may change the point at which we run the destructor and release
771ab31d003SDmitri Gribenkomemory. Specifically, if there is some user code after `delete`, then extending
772ab31d003SDmitri Gribenkothe lifetime of the object until the end of the function may hold locks for
773ab31d003SDmitri Gribenkolonger than necessary, introduce memory overhead etc.
774ab31d003SDmitri Gribenko
775ab31d003SDmitri GribenkoOne solution is to always replace `delete` with a call to `reset()`, and then
776ab31d003SDmitri Gribenkoperform another analysis that removes unnecessary `reset()` calls.
777ab31d003SDmitri Gribenko
778ab31d003SDmitri Gribenko```c++
779ab31d003SDmitri Gribenkovoid AddedMemoryOverhead() {
780ab31d003SDmitri Gribenko  HugeObject *ho = new HugeObject();
781ab31d003SDmitri Gribenko  use(ho);
782ab31d003SDmitri Gribenko  delete ho; // Release the large amount of memory quickly.
783ab31d003SDmitri Gribenko  LongRunningFunction();
784ab31d003SDmitri Gribenko}
785ab31d003SDmitri Gribenko```
786ab31d003SDmitri Gribenko
787ab31d003SDmitri GribenkoThis analysis will refuse to refactor code that mixes borrowed pointer values
788ab31d003SDmitri Gribenkoand unique ownership. In the following code, `GetPtr()` returns a borrowed
789ab31d003SDmitri Gribenkopointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
790ab31d003SDmitri Gribenkopointer. We don't distinguish between these two assignments, and we want each
791ab31d003SDmitri Gribenkoassignment to be paired with a corresponding sink; otherwise, we transition the
792ab31d003SDmitri Gribenkopointer to a `Conflicting` state, like in this example.
793ab31d003SDmitri Gribenko
794ab31d003SDmitri Gribenko```c++
795ab31d003SDmitri Gribenkovoid ConflictingOwnership() {
796ab31d003SDmitri Gribenko  int *pi;           // pi is Compatible
797ab31d003SDmitri Gribenko  pi = GetPtr();     // pi is Defined
798ab31d003SDmitri Gribenko  Borrow(pi);        // pi is Defined
799ab31d003SDmitri Gribenko
800ab31d003SDmitri Gribenko  pi = new int;      // pi is Conflicting
801ab31d003SDmitri Gribenko  Borrow(pi);
802ab31d003SDmitri Gribenko  delete pi;
803ab31d003SDmitri Gribenko  // pi is Conflicting
804ab31d003SDmitri Gribenko}
805ab31d003SDmitri Gribenko```
806ab31d003SDmitri Gribenko
807ab31d003SDmitri GribenkoWe could still handle this case by finding a maximal range in the code where
808ab31d003SDmitri Gribenko`pi` could be in the Compatible state, and only refactoring that part.
809ab31d003SDmitri Gribenko
810ab31d003SDmitri Gribenko```c++
811ab31d003SDmitri Gribenkovoid ConflictingOwnership() {
812ab31d003SDmitri Gribenko  int *pi;
813ab31d003SDmitri Gribenko  pi = GetPtr();
814ab31d003SDmitri Gribenko  Borrow(pi);
815ab31d003SDmitri Gribenko
816ab31d003SDmitri Gribenko  std::unique_ptr<int> pi_unique = std::make_unique<int>();
817ab31d003SDmitri Gribenko  Borrow(pi_unique.get());
818ab31d003SDmitri Gribenko}
819ab31d003SDmitri Gribenko```
820ab31d003SDmitri Gribenko
821ab31d003SDmitri Gribenko## Example: finding redundant branch conditions
822ab31d003SDmitri Gribenko
823ab31d003SDmitri GribenkoIn the code below `b1` should not be checked in both the outer and inner "if"
824ab31d003SDmitri Gribenkostatements. It is likely there is a bug in this code.
825ab31d003SDmitri Gribenko
826ab31d003SDmitri Gribenko```c++
827ab31d003SDmitri Gribenkoint F(bool b1, bool b2) {
828ab31d003SDmitri Gribenko  if (b1) {
829ab31d003SDmitri Gribenko    f();
830ab31d003SDmitri Gribenko    if (b1 && b2) {  // Check `b1` again -- unnecessary!
831ab31d003SDmitri Gribenko      g();
832ab31d003SDmitri Gribenko    }
833ab31d003SDmitri Gribenko  }
834ab31d003SDmitri Gribenko}
835ab31d003SDmitri Gribenko```
836ab31d003SDmitri Gribenko
837ab31d003SDmitri GribenkoA checker that finds this pattern syntactically is already implemented in
838ab31d003SDmitri GribenkoClangTidy using AST matchers (`bugprone-redundant-branch-condition`).
839ab31d003SDmitri Gribenko
840ab31d003SDmitri GribenkoTo implement it using the data flow analysis framework, we can produce a warning
841ab31d003SDmitri Gribenkoif any part of the branch condition is implied by the flow condition.
842ab31d003SDmitri Gribenko
843ab31d003SDmitri Gribenko```c++
844ab31d003SDmitri Gribenkoint F(bool b1, bool b2) {
845ab31d003SDmitri Gribenko  // Flow condition: true.
846ab31d003SDmitri Gribenko  if (b1) {
847ab31d003SDmitri Gribenko    // Flow condition: b1.
848ab31d003SDmitri Gribenko    f();
849ab31d003SDmitri Gribenko    if (b1 && b2) { // `b1` is implied by the flow condition.
850ab31d003SDmitri Gribenko      g();
851ab31d003SDmitri Gribenko    }
852ab31d003SDmitri Gribenko  }
853ab31d003SDmitri Gribenko}
854ab31d003SDmitri Gribenko```
855ab31d003SDmitri Gribenko
856ab31d003SDmitri GribenkoOne way to check this implication is to use a SAT solver. Without a SAT solver,
857ab31d003SDmitri Gribenkowe could keep the flow condition in the CNF form and then it would be easy to
858ab31d003SDmitri Gribenkocheck the implication.
859ab31d003SDmitri Gribenko
860ab31d003SDmitri Gribenko## Example: finding unchecked `std::optional` unwraps
861ab31d003SDmitri Gribenko
862ab31d003SDmitri GribenkoCalling `optional::value()` is only valid if `optional::has_value()` is true. We
863ab31d003SDmitri Gribenkowant to show that when `x.value()` is executed, the flow condition implies
864ab31d003SDmitri Gribenko`x.has_value()`.
865ab31d003SDmitri Gribenko
866ab31d003SDmitri GribenkoIn the example below `x.value()` is accessed safely because it is guarded by the
867ab31d003SDmitri Gribenko`x.has_value()` check.
868ab31d003SDmitri Gribenko
869ab31d003SDmitri Gribenko```c++
870ab31d003SDmitri Gribenkovoid Example(std::optional<int> &x) {
871ab31d003SDmitri Gribenko  if (x.has_value()) {
872ab31d003SDmitri Gribenko    use(x.value());
873ab31d003SDmitri Gribenko  }
874ab31d003SDmitri Gribenko}
875ab31d003SDmitri Gribenko```
876ab31d003SDmitri Gribenko
877ab31d003SDmitri GribenkoWhile entering the if branch we deduce that `x.has_value()` is implied by the
878ab31d003SDmitri Gribenkoflow condition.
879ab31d003SDmitri Gribenko
880ab31d003SDmitri Gribenko```c++
881ab31d003SDmitri Gribenkovoid Example(std::optional<int> x) {
882ab31d003SDmitri Gribenko  // Flow condition: true.
883ab31d003SDmitri Gribenko  if (x.has_value()) {
884ab31d003SDmitri Gribenko    // Flow condition: x.has_value() == true.
885ab31d003SDmitri Gribenko    use(x.value());
886ab31d003SDmitri Gribenko  }
887ab31d003SDmitri Gribenko  // Flow condition: true.
888ab31d003SDmitri Gribenko}
889ab31d003SDmitri Gribenko```
890ab31d003SDmitri Gribenko
891ab31d003SDmitri GribenkoWe also need to prove that `x` is not modified between check and value access.
892ab31d003SDmitri GribenkoThe modification of `x` may be very subtle:
893ab31d003SDmitri Gribenko
894ab31d003SDmitri Gribenko```c++
895ab31d003SDmitri Gribenkovoid F(std::optional<int> &x);
896ab31d003SDmitri Gribenko
897ab31d003SDmitri Gribenkovoid Example(std::optional<int> &x) {
898ab31d003SDmitri Gribenko  if (x.has_value()) {
899ab31d003SDmitri Gribenko    // Flow condition: x.has_value() == true.
900ab31d003SDmitri Gribenko    unknown_function(x); // may change x.
901ab31d003SDmitri Gribenko    // Flow condition: true.
902ab31d003SDmitri Gribenko    use(x.value());
903ab31d003SDmitri Gribenko  }
904ab31d003SDmitri Gribenko}
905ab31d003SDmitri Gribenko```
906ab31d003SDmitri Gribenko
907ab31d003SDmitri Gribenko## Example: finding dead code behind A/B experiment flags
908ab31d003SDmitri Gribenko
909ab31d003SDmitri GribenkoFinding dead code is a classic application of data flow analysis.
910ab31d003SDmitri Gribenko
911ab31d003SDmitri GribenkoUnused flags for A/B experiment hide dead code. However, this flavor of dead
912ab31d003SDmitri Gribenkocode is invisible to the compiler because the flag can be turned on at any
913ab31d003SDmitri Gribenkomoment.
914ab31d003SDmitri Gribenko
915ab31d003SDmitri GribenkoWe could make a tool that deletes experiment flags. The user tells us which flag
916ab31d003SDmitri Gribenkothey want to delete, and we assume that the it's value is a given constant.
917ab31d003SDmitri Gribenko
918ab31d003SDmitri GribenkoFor example, the user could use the tool to remove `example_flag` from this
919ab31d003SDmitri Gribenkocode:
920ab31d003SDmitri Gribenko
921ab31d003SDmitri Gribenko```c++
922ab31d003SDmitri GribenkoDEFINE_FLAG(std::string, example_flag, "", "A sample flag.");
923ab31d003SDmitri Gribenko
924ab31d003SDmitri Gribenkovoid Example() {
925ab31d003SDmitri Gribenko  bool x = GetFlag(FLAGS_example_flag).empty();
926ab31d003SDmitri Gribenko  f();
927ab31d003SDmitri Gribenko  if (x) {
928ab31d003SDmitri Gribenko    g();
929ab31d003SDmitri Gribenko  } else {
930ab31d003SDmitri Gribenko    h();
931ab31d003SDmitri Gribenko  }
932ab31d003SDmitri Gribenko}
933ab31d003SDmitri Gribenko```
934ab31d003SDmitri Gribenko
935ab31d003SDmitri GribenkoThe tool would simplify the code to:
936ab31d003SDmitri Gribenko
937ab31d003SDmitri Gribenko```c++
938ab31d003SDmitri Gribenkovoid Example() {
939ab31d003SDmitri Gribenko  f();
940ab31d003SDmitri Gribenko  g();
941ab31d003SDmitri Gribenko}
942ab31d003SDmitri Gribenko```
943ab31d003SDmitri Gribenko
944ab31d003SDmitri GribenkoWe can solve this problem with a classic constant propagation lattice combined
945ab31d003SDmitri Gribenkowith symbolic evaluation.
946ab31d003SDmitri Gribenko
947ab31d003SDmitri Gribenko## Example: finding inefficient usages of associative containers
948ab31d003SDmitri Gribenko
949ab31d003SDmitri GribenkoReal-world code often accidentally performs repeated lookups in associative
950ab31d003SDmitri Gribenkocontainers:
951ab31d003SDmitri Gribenko
952ab31d003SDmitri Gribenko```c++
953ab31d003SDmitri Gribenkomap<int, Employee> xs;
954ab31d003SDmitri Gribenkoxs[42]->name = "...";
955ab31d003SDmitri Gribenkoxs[42]->title = "...";
956ab31d003SDmitri Gribenko```
957ab31d003SDmitri Gribenko
958ab31d003SDmitri GribenkoTo find the above inefficiency we can use the available expressions analysis to
959ab31d003SDmitri Gribenkounderstand that `m[42]` is evaluated twice.
960ab31d003SDmitri Gribenko
961ab31d003SDmitri Gribenko```c++
962ab31d003SDmitri Gribenkomap<int, Employee> xs;
963ab31d003SDmitri GribenkoEmployee &e = xs[42];
964ab31d003SDmitri Gribenkoe->name = "...";
965ab31d003SDmitri Gribenkoe->title = "...";
966ab31d003SDmitri Gribenko```
967ab31d003SDmitri Gribenko
968ab31d003SDmitri GribenkoWe can also track the `m.contains()` check in the flow condition to find
969ab31d003SDmitri Gribenkoredundant checks, like in the example below.
970ab31d003SDmitri Gribenko
971ab31d003SDmitri Gribenko```c++
972ab31d003SDmitri Gribenkostd::map<int, Employee> xs;
973ab31d003SDmitri Gribenkoif (!xs.contains(42)) {
974ab31d003SDmitri Gribenko  xs.insert({42, someEmployee});
975ab31d003SDmitri Gribenko}
976ab31d003SDmitri Gribenko```
977ab31d003SDmitri Gribenko
978ab31d003SDmitri Gribenko## Example: refactoring types that implicitly convert to each other
979ab31d003SDmitri Gribenko
980ab31d003SDmitri GribenkoRefactoring one strong type to another is difficult, but the compiler can help:
981ab31d003SDmitri Gribenkoonce you refactor one reference to the type, the compiler will flag other places
982ab31d003SDmitri Gribenkowhere this information flows with type mismatch errors. Unfortunately this
983ab31d003SDmitri Gribenkostrategy does not work when you are refactoring types that implicitly convert to
984ab31d003SDmitri Gribenkoeach other, for example, replacing `int32_t` with `int64_t`.
985ab31d003SDmitri Gribenko
986ab31d003SDmitri GribenkoImagine that we want to change user IDs from 32 to 64-bit integers. In other
987ab31d003SDmitri Gribenkowords, we need to find all integers tainted with user IDs. We can use data flow
988ab31d003SDmitri Gribenkoanalysis to implement taint analysis.
989ab31d003SDmitri Gribenko
990ab31d003SDmitri Gribenko```c++
991ab31d003SDmitri Gribenkovoid UseUser(int32_t user_id) {
992ab31d003SDmitri Gribenko  int32_t id = user_id;
993ab31d003SDmitri Gribenko  // Variable `id` is tainted with a user ID.
994ab31d003SDmitri Gribenko  ...
995ab31d003SDmitri Gribenko}
996ab31d003SDmitri Gribenko```
997ab31d003SDmitri Gribenko
998ab31d003SDmitri GribenkoTaint analysis is very well suited to this problem because the program rarely
999ab31d003SDmitri Gribenkobranches on user IDs, and almost certainly does not perform any computation
1000ab31d003SDmitri Gribenko(like arithmetic).
1001