xref: /openbsd-src/gnu/llvm/clang/docs/DataFlowAnalysisIntro.md (revision 12c855180aad702bbcca06e0398d774beeafb155)
1*12c85518Srobert# Data flow analysis: an informal introduction
2*12c85518Srobert
3*12c85518Srobert## Abstract
4*12c85518Srobert
5*12c85518SrobertThis document introduces data flow analysis in an informal way. The goal is to
6*12c85518Srobertgive the reader an intuitive understanding of how it works, and show how it
7*12c85518Srobertapplies to a range of refactoring and bug finding problems.
8*12c85518Srobert
9*12c85518SrobertData flow analysis is a well-established technique; it is described in many
10*12c85518Srobertpapers, books, and videos. If you would like a more formal, or a more thorough
11*12c85518Srobertexplanation of the concepts mentioned in this document, please refer to the
12*12c85518Srobertfollowing resources:
13*12c85518Srobert
14*12c85518Srobert*   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
15*12c85518Srobert*   Videos on the PacketPrep YouTube channel that introduce lattices and the
16*12c85518Srobert    necessary background information:
17*12c85518Srobert    [#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
18*12c85518Srobert    [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
19*12c85518Srobert    [#22](https://www.youtube.com/watch?v=saOG7Uooeho),
20*12c85518Srobert    [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
21*12c85518Srobert    [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
22*12c85518Srobert    [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
23*12c85518Srobert*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
24*12c85518Srobert*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
25*12c85518Srobert*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
26*12c85518Srobert*   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
27*12c85518Srobert*   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
28*12c85518Srobert    (a paper that successfully applies symbolic execution to real-world
29*12c85518Srobert    software).
30*12c85518Srobert
31*12c85518Srobert## Data flow analysis
32*12c85518Srobert
33*12c85518Srobert### The purpose of data flow analysis
34*12c85518Srobert
35*12c85518SrobertData flow analysis is a static analysis technique that proves facts about a
36*12c85518Srobertprogram or its fragment. It can make conclusions about all paths through the
37*12c85518Srobertprogram, while taking control flow into account and scaling to large programs.
38*12c85518SrobertThe basic idea is propagating facts about the program through the edges of the
39*12c85518Srobertcontrol flow graph (CFG) until a fixpoint is reached.
40*12c85518Srobert
41*12c85518Srobert### Sample problem and an ad-hoc solution
42*12c85518Srobert
43*12c85518SrobertWe would like to explain data flow analysis while discussing an example. Let's
44*12c85518Srobertimagine that we want to track possible values of an integer variable in our
45*12c85518Srobertprogram. Here is how a human could annotate the code:
46*12c85518Srobert
47*12c85518Srobert```c++
48*12c85518Srobertvoid Example(int n) {
49*12c85518Srobert  int x = 0;
50*12c85518Srobert  // x is {0}
51*12c85518Srobert  if (n > 0) {
52*12c85518Srobert    x = 5;
53*12c85518Srobert    // x is {5}
54*12c85518Srobert  } else {
55*12c85518Srobert    x = 42;
56*12c85518Srobert    // x is {42}
57*12c85518Srobert  }
58*12c85518Srobert  // x is {5; 42}
59*12c85518Srobert  print(x);
60*12c85518Srobert}
61*12c85518Srobert```
62*12c85518Srobert
63*12c85518SrobertWe use sets of integers to represent possible values of `x`. Local variables
64*12c85518Sroberthave unambiguous values between statements, so we annotate program points
65*12c85518Srobertbetween statements with sets of possible values.
66*12c85518Srobert
67*12c85518SrobertHere is how we arrived at these annotations. Assigning a constant to `x` allows
68*12c85518Srobertus to make a conclusion that `x` can only have one value. When control flow from
69*12c85518Srobertthe "then" and "else" branches joins, `x` can have either value.
70*12c85518Srobert
71*12c85518SrobertAbstract algebra provides a nice formalism that models this kind of structure,
72*12c85518Srobertnamely, a lattice. A join-semilattice is a partially ordered set, in which every
73*12c85518Sroberttwo elements have a least upper bound (called a *join*).
74*12c85518Srobert
75*12c85518Srobert```
76*12c85518Srobertjoin(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x
77*12c85518Srobert```
78*12c85518Srobert
79*12c85518SrobertFor this problem we will use the lattice of subsets of integers, with set
80*12c85518Srobertinclusion relation as ordering and set union as a join.
81*12c85518Srobert
82*12c85518SrobertLattices are often represented visually as Hasse diagrams. Here is a Hasse
83*12c85518Srobertdiagram for our lattice that tracks subsets of integers:
84*12c85518Srobert
85*12c85518Srobert![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)
86*12c85518Srobert
87*12c85518SrobertComputing the join in the lattice corresponds to finding the lowest common
88*12c85518Srobertancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
89*12c85518Srobertliterature on efficiently implementing LCA queries for a DAG, however Efficient
90*12c85518SrobertImplementation of Lattice Operations (1989)
91*12c85518Srobert([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
92*12c85518Srobert[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
93*12c85518Srobertparticularly well-suited for programmatic implementation.
94*12c85518Srobert
95*12c85518Srobert### Too much information and "top" values
96*12c85518Srobert
97*12c85518SrobertLet's try to find the possible sets of values of `x` in a function that modifies
98*12c85518Srobert`x` in a loop:
99*12c85518Srobert
100*12c85518Srobert```c++
101*12c85518Srobertvoid ExampleOfInfiniteSets() {
102*12c85518Srobert  int x = 0; // x is {0}
103*12c85518Srobert  while (condition()) {
104*12c85518Srobert    x += 1;  // x is {0; 1; 2; …}
105*12c85518Srobert  }
106*12c85518Srobert  print(x);  // x is {0; 1; 2; …}
107*12c85518Srobert}
108*12c85518Srobert```
109*12c85518Srobert
110*12c85518SrobertWe have an issue: `x` can have any value greater than zero; that's an infinite
111*12c85518Srobertset of values, if the program operated on mathematical integers. In C++ `int` is
112*12c85518Srobertlimited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
113*12c85518Srobertstill really big.
114*12c85518Srobert
115*12c85518SrobertTo make our analysis practical to compute, we have to limit the amount of
116*12c85518Srobertinformation that we track. In this case, we can, for example, arbitrarily limit
117*12c85518Srobertthe size of sets to 3 elements. If at a certain program point `x` has more than
118*12c85518Srobert3 possible values, we stop tracking specific values at that program point.
119*12c85518SrobertInstead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
120*12c85518Srobertaccording to a convention in abstract algebra).
121*12c85518Srobert
122*12c85518Srobert```c++
123*12c85518Srobertvoid ExampleOfTopWithALoop() {
124*12c85518Srobert  int x = 0;  // x is {0}
125*12c85518Srobert  while (condition()) {
126*12c85518Srobert    x += 1;   // x is ⊤
127*12c85518Srobert  }
128*12c85518Srobert  print(x);   // x is ⊤
129*12c85518Srobert}
130*12c85518Srobert```
131*12c85518Srobert
132*12c85518SrobertThe statement "at this program point, `x`'s possible values are `⊤`" is
133*12c85518Srobertunderstood as "at this program point `x` can have any value because we have too
134*12c85518Srobertmuch information, or the information is conflicting".
135*12c85518Srobert
136*12c85518SrobertNote that we can get more than 3 possible values even without a loop:
137*12c85518Srobert
138*12c85518Srobert```c++
139*12c85518Srobertvoid ExampleOfTopWithoutLoops(int n) {
140*12c85518Srobert  int x = 0;  // x is {0}
141*12c85518Srobert  switch(n) {
142*12c85518Srobert    case 0:  x = 1; break; // x is {1}
143*12c85518Srobert    case 1:  x = 9; break; // x is {9}
144*12c85518Srobert    case 2:  x = 7; break; // x is {7}
145*12c85518Srobert    default: x = 3; break; // x is {3}
146*12c85518Srobert  }
147*12c85518Srobert  // x is ⊤
148*12c85518Srobert}
149*12c85518Srobert```
150*12c85518Srobert
151*12c85518Srobert### Uninitialized variables and "bottom" values
152*12c85518Srobert
153*12c85518SrobertWhen `x` is declared but not initialized, it has no possible values. We
154*12c85518Srobertrepresent this fact symbolically as `⊥` (pronounced "bottom").
155*12c85518Srobert
156*12c85518Srobert```c++
157*12c85518Srobertvoid ExampleOfBottom() {
158*12c85518Srobert  int x;    // x is ⊥
159*12c85518Srobert  x = 42;   // x is {42}
160*12c85518Srobert  print(x);
161*12c85518Srobert}
162*12c85518Srobert```
163*12c85518Srobert
164*12c85518SrobertNote that using values read from uninitialized variables is undefined behaviour
165*12c85518Srobertin C++. Generally, compilers and static analysis tools can assume undefined
166*12c85518Srobertbehavior does not happen. We must model uninitialized variables only when we are
167*12c85518Srobertimplementing a checker that specifically is trying to find uninitialized reads.
168*12c85518SrobertIn this example we show how to model uninitialized variables only to demonstrate
169*12c85518Srobertthe concept of "bottom", and how it applies to possible value analysis. We
170*12c85518Srobertdescribe an analysis that finds uninitialized reads in a section below.
171*12c85518Srobert
172*12c85518Srobert### A practical lattice that tracks sets of concrete values
173*12c85518Srobert
174*12c85518SrobertTaking into account all corner cases covered above, we can put together a
175*12c85518Srobertlattice that we can use in practice to track possible values of integer
176*12c85518Srobertvariables. This lattice represents sets of integers with 1, 2, or 3 elements, as
177*12c85518Srobertwell as top and bottom. Here is a Hasse diagram for it:
178*12c85518Srobert
179*12c85518Srobert![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)
180*12c85518Srobert
181*12c85518Srobert### Formalization
182*12c85518Srobert
183*12c85518SrobertLet's consider a slightly more complex example, and think about how we can
184*12c85518Srobertcompute the sets of possible values algorithmically.
185*12c85518Srobert
186*12c85518Srobert```c++
187*12c85518Srobertvoid Example(int n) {
188*12c85518Srobert  int x;          // x is ⊥
189*12c85518Srobert  if (n > 0) {
190*12c85518Srobert    if (n == 42) {
191*12c85518Srobert       x = 44;    // x is {44}
192*12c85518Srobert    } else {
193*12c85518Srobert       x = 5;     // x is {5}
194*12c85518Srobert    }
195*12c85518Srobert    print(x);     // x is {44; 5}
196*12c85518Srobert  } else {
197*12c85518Srobert    x = n;        // x is ⊤
198*12c85518Srobert  }
199*12c85518Srobert  print(x);       // x is ⊤
200*12c85518Srobert}
201*12c85518Srobert```
202*12c85518Srobert
203*12c85518SrobertAs humans, we understand the control flow from the program text. We used our
204*12c85518Srobertunderstanding of control flow to find program points where two flows join.
205*12c85518SrobertFormally, control flow is represented by a CFG (control flow graph):
206*12c85518Srobert
207*12c85518Srobert![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)
208*12c85518Srobert
209*12c85518SrobertWe can compute sets of possible values by propagating them through the CFG of
210*12c85518Srobertthe function:
211*12c85518Srobert
212*12c85518Srobert*   When `x` is declared but not initialized, its possible values are `{}`. The
213*12c85518Srobert    empty set plays the role of `⊥` in this lattice.
214*12c85518Srobert
215*12c85518Srobert*   When `x` is assigned a concrete value, its possible set of values contains
216*12c85518Srobert    just that specific value.
217*12c85518Srobert
218*12c85518Srobert*   When `x` is assigned some unknown value, it can have any value. We represent
219*12c85518Srobert    this fact as `⊤`.
220*12c85518Srobert
221*12c85518Srobert*   When two control flow paths join, we compute the set union of incoming
222*12c85518Srobert    values (limiting the number of elements to 3, representig larger sets as
223*12c85518Srobert    `⊤`).
224*12c85518Srobert
225*12c85518SrobertThe sets of possible values are influenced by:
226*12c85518Srobert
227*12c85518Srobert*   Statements, for example, assignments.
228*12c85518Srobert
229*12c85518Srobert*   Joins in control flow, for example, ones that appear at the end of "if"
230*12c85518Srobert    statements.
231*12c85518Srobert
232*12c85518Srobert**Effects of statements** are modeled by what is formally known as a transfer
233*12c85518Srobertfunction. A transfer function takes two arguments: the statement, and the state
234*12c85518Srobertof `x` at the previous program point. It produces the state of `x` at the next
235*12c85518Srobertprogram point. For example, the transfer function for assignment ignores the
236*12c85518Srobertstate at the previous program point:
237*12c85518Srobert
238*12c85518Srobert```c++
239*12c85518Srobert// GIVEN: x is {42; 44}
240*12c85518Srobertx = 0;
241*12c85518Srobert// CONCLUSION: x is {0}
242*12c85518Srobert```
243*12c85518Srobert
244*12c85518SrobertThe transfer function for `+` performs arithmetic on every set member:
245*12c85518Srobert
246*12c85518Srobert```c++
247*12c85518Srobert// GIVEN: x is {42, 44}
248*12c85518Srobertx = x + 100;
249*12c85518Srobert// CONCLUSION: x is {142, 144}
250*12c85518Srobert```
251*12c85518Srobert
252*12c85518Srobert**Effects of control flow** are modeled by joining the knowledge from all
253*12c85518Srobertpossible previous program points.
254*12c85518Srobert
255*12c85518Srobert```c++
256*12c85518Srobertif (...) {
257*12c85518Srobert  ...
258*12c85518Srobert  // GIVEN: x is {42}
259*12c85518Srobert} else {
260*12c85518Srobert  ...
261*12c85518Srobert  // GIVEN: x is {44}
262*12c85518Srobert}
263*12c85518Srobert// CONCLUSION: x is {42; 44}
264*12c85518Srobert```
265*12c85518Srobert
266*12c85518Srobert```c++
267*12c85518Srobert// GIVEN: x is {42}
268*12c85518Srobertwhile (...) {
269*12c85518Srobert  ...
270*12c85518Srobert  // GIVEN: x is {44}
271*12c85518Srobert}
272*12c85518Srobert// CONCLUSION: {42; 44}
273*12c85518Srobert```
274*12c85518Srobert
275*12c85518SrobertThe predicate that we marked "given" is usually called a precondition, and the
276*12c85518Srobertconclusion is called a postcondition.
277*12c85518Srobert
278*12c85518SrobertIn terms of the CFG, we join the information from all predecessor basic blocks.
279*12c85518Srobert
280*12c85518Srobert![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)
281*12c85518Srobert
282*12c85518SrobertPutting it all together, to model the effects of a basic block we compute:
283*12c85518Srobert
284*12c85518Srobert```
285*12c85518Srobertout = transfer(basic_block, join(in_1, in_2, ..., in_n))
286*12c85518Srobert```
287*12c85518Srobert
288*12c85518Srobert(Note that there are other ways to write this equation that produce higher
289*12c85518Srobertprecision analysis results. The trick is to keep exploring the execution paths
290*12c85518Srobertseparately and delay joining until later. However, we won't discuss those
291*12c85518Srobertvariations here.)
292*12c85518Srobert
293*12c85518SrobertTo make a conclusion about all paths through the program, we repeat this
294*12c85518Srobertcomputation on all basic blocks until we reach a fixpoint. In other words, we
295*12c85518Srobertkeep propagating information through the CFG until the computed sets of values
296*12c85518Srobertstop changing.
297*12c85518Srobert
298*12c85518SrobertIf the lattice has a finite height and transfer functions are monotonic the
299*12c85518Srobertalgorithm is guaranteed to terminate.  Each iteration of the algorithm can
300*12c85518Srobertchange computed values only to larger values from the lattice. In the worst
301*12c85518Srobertcase, all computed values become `⊤`, which is not very useful, but at least the
302*12c85518Srobertanalysis terminates at that point, because it can't change any of the values.
303*12c85518Srobert
304*12c85518SrobertFixpoint iteration can be optimised by only reprocessing basic blocks which had
305*12c85518Srobertone of their inputs changed on the previous iteration. This is typically
306*12c85518Srobertimplemented using a worklist queue. With this optimisation the time complexity
307*12c85518Srobertbecomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
308*12c85518Srobert`|L|` is the size of lattice used by the analysis.
309*12c85518Srobert
310*12c85518Srobert## Symbolic execution: a very short informal introduction
311*12c85518Srobert
312*12c85518Srobert### Symbolic values
313*12c85518Srobert
314*12c85518SrobertIn the previous example where we tried to figure out what values a variable can
315*12c85518Sroberthave, the analysis had to be seeded with a concrete value. What if there are no
316*12c85518Srobertassignments of concrete values in the program? We can still deduce some
317*12c85518Srobertinteresting information by representing unknown input values symbolically, and
318*12c85518Srobertcomputing results as symbolic expressions:
319*12c85518Srobert
320*12c85518Srobert```c++
321*12c85518Srobertvoid PrintAbs(int x) {
322*12c85518Srobert  int result;
323*12c85518Srobert  if (x >= 0) {
324*12c85518Srobert    result = x;   // result is {x}
325*12c85518Srobert  } else {
326*12c85518Srobert    result = -x;  // result is {-x}
327*12c85518Srobert  }
328*12c85518Srobert  print(result);  // result is {x; -x}
329*12c85518Srobert}
330*12c85518Srobert```
331*12c85518Srobert
332*12c85518SrobertWe can't say what specific value gets printed, but we know that it is either `x`
333*12c85518Srobertor `-x`.
334*12c85518Srobert
335*12c85518SrobertDataflow analysis is an istance of abstract interpretation, and does not dictate
336*12c85518Sroberthow exactly the lattice and transfer functions should be designed, beyond the
337*12c85518Srobertnecessary conditions for the analysis to converge. Nevertheless, we can use
338*12c85518Srobertsymbolic execution ideas to guide our design of the lattice and transfer
339*12c85518Srobertfunctions: lattice values can be symbolic expressions, and transfer functions
340*12c85518Srobertcan construct more complex symbolic expressions from symbolic expressions that
341*12c85518Srobertrepresent arguments. See [this StackOverflow
342*12c85518Srobertdiscussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
343*12c85518Srobertfor a further comparison of abstract interpretation and symbolic execution.
344*12c85518Srobert
345*12c85518Srobert### Flow condition
346*12c85518Srobert
347*12c85518SrobertA human can say about the previous example that the function returns `x` when
348*12c85518Srobert`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
349*12c85518Sroberttracking a flow condition. A flow condition is a predicate written in terms of
350*12c85518Srobertthe program state that is true at a specific program point regardless of the
351*12c85518Srobertexecution path that led to this statement. For example, the flow condition for
352*12c85518Srobertthe program point right before evaluating `result = x` is `x >= 0`.
353*12c85518Srobert
354*12c85518SrobertIf we enhance the lattice to be a set of pairs of values and predicates, the
355*12c85518Srobertdataflow analysis computes the following values:
356*12c85518Srobert
357*12c85518Srobert```c++
358*12c85518Srobertvoid PrintAbs(int x) {
359*12c85518Srobert  int result;
360*12c85518Srobert  if (x >= 0) {
361*12c85518Srobert    // Flow condition: x >= 0.
362*12c85518Srobert    result = x;   // result is {x if x >= 0}
363*12c85518Srobert  } else {
364*12c85518Srobert    // Flow condition: x < 0.
365*12c85518Srobert    result = -x;  // result is {-x if x < 0}
366*12c85518Srobert  }
367*12c85518Srobert  print(result);  // result is {x if x >= 0; -x if x < 0}
368*12c85518Srobert}
369*12c85518Srobert```
370*12c85518Srobert
371*12c85518SrobertOf course, in a program with loops, symbolic expressions for flow conditions can
372*12c85518Srobertgrow unbounded. A practical static analysis system must control this growth to
373*12c85518Srobertkeep the symbolic representations manageable and ensure that the data flow
374*12c85518Srobertanalysis terminates. For example, it can use a constraint solver to prune
375*12c85518Srobertimpossible flow conditions, and/or it can abstract them, losing precision, after
376*12c85518Sroberttheir symbolic representations grow beyond some threshold. This is similar to
377*12c85518Sroberthow we had to limit the sizes of computed sets of possible values to 3 elements.
378*12c85518Srobert
379*12c85518Srobert### Symbolic pointers
380*12c85518Srobert
381*12c85518SrobertThis approach proves to be particularly useful for modeling pointer values,
382*12c85518Srobertsince we don't care about specific addresses but just want to give a unique
383*12c85518Srobertidentifier to a memory location.
384*12c85518Srobert
385*12c85518Srobert```c++
386*12c85518Srobertvoid ExampleOfSymbolicPointers(bool b) {
387*12c85518Srobert  int x = 0;     // x is {0}
388*12c85518Srobert  int* ptr = &x; // x is {0}      ptr is {&x}
389*12c85518Srobert  if (b) {
390*12c85518Srobert    *ptr = 42;   // x is {42}     ptr is {&x}
391*12c85518Srobert  }
392*12c85518Srobert  print(x);      // x is {0; 42}  ptr is {&x}
393*12c85518Srobert}
394*12c85518Srobert```
395*12c85518Srobert
396*12c85518Srobert## Example: finding output parameters
397*12c85518Srobert
398*12c85518SrobertLet's explore how data flow analysis can help with a problem that is hard to
399*12c85518Srobertsolve with other tools in Clang.
400*12c85518Srobert
401*12c85518Srobert### Problem description
402*12c85518Srobert
403*12c85518SrobertOutput parameters are function parameters of pointer or reference type whose
404*12c85518Srobertpointee is completely overwritten by the function, and not read before it is
405*12c85518Srobertoverwritten. They are common in pre-C++11 code due to the absence of move
406*12c85518Srobertsemantics. In modern C++ output parameters are non-idiomatic, and return values
407*12c85518Srobertare used instead.
408*12c85518Srobert
409*12c85518SrobertImagine that we would like to refactor output parameters to return values to
410*12c85518Srobertmodernize old code. The first step is to identify refactoring candidates through
411*12c85518Srobertstatic analysis.
412*12c85518Srobert
413*12c85518SrobertFor example, in the following code snippet the pointer `c` is an output
414*12c85518Srobertparameter:
415*12c85518Srobert
416*12c85518Srobert```c++
417*12c85518Srobertstruct Customer {
418*12c85518Srobert  int account_id;
419*12c85518Srobert  std::string name;
420*12c85518Srobert}
421*12c85518Srobert
422*12c85518Srobertvoid GetCustomer(Customer *c) {
423*12c85518Srobert  c->account_id = ...;
424*12c85518Srobert  if (...) {
425*12c85518Srobert    c->name = ...;
426*12c85518Srobert  } else {
427*12c85518Srobert    c->name = ...;
428*12c85518Srobert  }
429*12c85518Srobert}
430*12c85518Srobert```
431*12c85518Srobert
432*12c85518SrobertWe would like to refactor this code into:
433*12c85518Srobert
434*12c85518Srobert```c++
435*12c85518SrobertCustomer GetCustomer() {
436*12c85518Srobert  Customer c;
437*12c85518Srobert  c.account_id = ...;
438*12c85518Srobert  if (...) {
439*12c85518Srobert    c.name = ...;
440*12c85518Srobert  } else {
441*12c85518Srobert    c.name = ...;
442*12c85518Srobert  }
443*12c85518Srobert  return c;
444*12c85518Srobert}
445*12c85518Srobert```
446*12c85518Srobert
447*12c85518SrobertHowever, in the function below the parameter `c` is not an output parameter
448*12c85518Srobertbecause its field `name` is not overwritten on every path through the function.
449*12c85518Srobert
450*12c85518Srobert```c++
451*12c85518Srobertvoid GetCustomer(Customer *c) {
452*12c85518Srobert  c->account_id = ...;
453*12c85518Srobert  if (...) {
454*12c85518Srobert    c->name = ...;
455*12c85518Srobert  }
456*12c85518Srobert}
457*12c85518Srobert```
458*12c85518Srobert
459*12c85518SrobertThe code also cannot read the value of the parameter before overwriting it:
460*12c85518Srobert
461*12c85518Srobert```c++
462*12c85518Srobertvoid GetCustomer(Customer *c) {
463*12c85518Srobert  use(c->account_id);
464*12c85518Srobert  c->name = ...;
465*12c85518Srobert  c->account_id = ...;
466*12c85518Srobert}
467*12c85518Srobert```
468*12c85518Srobert
469*12c85518SrobertFunctions that escape the pointer also block the refactoring:
470*12c85518Srobert
471*12c85518Srobert```c++
472*12c85518SrobertCustomer* kGlobalCustomer;
473*12c85518Srobert
474*12c85518Srobertvoid GetCustomer(Customer *c) {
475*12c85518Srobert  c->name = ...;
476*12c85518Srobert  c->account_id = ...;
477*12c85518Srobert  kGlobalCustomer = c;
478*12c85518Srobert}
479*12c85518Srobert```
480*12c85518Srobert
481*12c85518SrobertTo identify a candidate function for refactoring, we need to do the following:
482*12c85518Srobert
483*12c85518Srobert*   Find a function with a non-const pointer or reference parameter.
484*12c85518Srobert
485*12c85518Srobert*   Find the definition of that function.
486*12c85518Srobert
487*12c85518Srobert*   Prove that the function completely overwrites the pointee on all paths
488*12c85518Srobert    before returning.
489*12c85518Srobert
490*12c85518Srobert*   Prove that the function reads the pointee only after overwriting it.
491*12c85518Srobert
492*12c85518Srobert*   Prove that the function does not persist the pointer in a data structure
493*12c85518Srobert    that is live after the function returns.
494*12c85518Srobert
495*12c85518SrobertThere are also requirements that all usage sites of the candidate function must
496*12c85518Srobertsatisfy, for example, that function arguments do not alias, that users are not
497*12c85518Sroberttaking the address of the function, and so on. Let's consider verifying usage
498*12c85518Srobertsite conditions to be a separate static analysis problem.
499*12c85518Srobert
500*12c85518Srobert### Lattice design
501*12c85518Srobert
502*12c85518SrobertTo analyze the function body we can use a lattice which consists of normal
503*12c85518Srobertstates and failure states. A normal state describes program points where we are
504*12c85518Srobertsure that no behaviors that block the refactoring have occurred. Normal states
505*12c85518Srobertkeep track of all parameter's member fields that are known to be overwritten on
506*12c85518Srobertevery path from function entry to the corresponding program point. Failure
507*12c85518Srobertstates accumulate observed violations (unsafe reads and pointer escapes) that
508*12c85518Srobertblock the refactoring.
509*12c85518Srobert
510*12c85518SrobertIn the partial order of the lattice failure states compare greater than normal
511*12c85518Srobertstates, which guarantees that they "win" when joined with normal states. Order
512*12c85518Srobertbetween failure states is determined by inclusion relation on the set of
513*12c85518Srobertaccumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
514*12c85518Srobertbetween normal states is determined by reversed inclusion relation on the set of
515*12c85518Srobertoverwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
516*12c85518Srobertoverwritten fields).
517*12c85518Srobert
518*12c85518Srobert![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)
519*12c85518Srobert
520*12c85518SrobertTo determine whether a statement reads or writes a field we can implement
521*12c85518Srobertsymbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
522*12c85518Srobertdereference operator and `MemberExpr`s.
523*12c85518Srobert
524*12c85518Srobert### Using data flow results to identify output parameters
525*12c85518Srobert
526*12c85518SrobertLet's take a look at how we use data flow analysis to identify an output
527*12c85518Srobertparameter. The refactoring can be safely done when the data flow algorithm
528*12c85518Srobertcomputes a normal state with all of the fields proven to be overwritten in the
529*12c85518Srobertexit basic block of the function.
530*12c85518Srobert
531*12c85518Srobert```c++
532*12c85518Srobertstruct Customer {
533*12c85518Srobert  int account_id;
534*12c85518Srobert  std::string name;
535*12c85518Srobert};
536*12c85518Srobert
537*12c85518Srobertvoid GetCustomer(Customer* c) {
538*12c85518Srobert  // Overwritten: {}
539*12c85518Srobert  c->account_id = ...; // Overwritten: {c->account_id}
540*12c85518Srobert  if (...) {
541*12c85518Srobert    c->name = ...;     // Overwritten: {c->account_id, c->name}
542*12c85518Srobert  } else {
543*12c85518Srobert    c->name = ...;     // Overwritten: {c->account_id, c->name}
544*12c85518Srobert  }
545*12c85518Srobert  // Overwritten: {c->account_id, c->name}
546*12c85518Srobert}
547*12c85518Srobert```
548*12c85518Srobert
549*12c85518SrobertWhen the data flow algorithm computes a normal state, but not all fields are
550*12c85518Srobertproven to be overwritten we can't perform the refactoring.
551*12c85518Srobert
552*12c85518Srobert```c++
553*12c85518Srobertvoid target(bool b, Customer* c) {
554*12c85518Srobert  // Overwritten: {}
555*12c85518Srobert  if (b) {
556*12c85518Srobert    c->account_id = 42;     // Overwritten: {c->account_id}
557*12c85518Srobert  } else {
558*12c85518Srobert    c->name = "Konrad";  // Overwritten: {c->name}
559*12c85518Srobert  }
560*12c85518Srobert  // Overwritten: {}
561*12c85518Srobert}
562*12c85518Srobert```
563*12c85518Srobert
564*12c85518SrobertSimilarly, when the data flow algorithm computes a failure state, we also can't
565*12c85518Srobertperform the refactoring.
566*12c85518Srobert
567*12c85518Srobert```c++
568*12c85518SrobertCustomer* kGlobalCustomer;
569*12c85518Srobert
570*12c85518Srobertvoid GetCustomer(Customer* c) {
571*12c85518Srobert  // Overwritten: {}
572*12c85518Srobert  c->account_id = ...;    // Overwritten: {c->account_id}
573*12c85518Srobert  if (...) {
574*12c85518Srobert    print(c->name);       // Unsafe read
575*12c85518Srobert  } else {
576*12c85518Srobert    kGlobalCustomer = c;  // Pointer escape
577*12c85518Srobert  }
578*12c85518Srobert  // Unsafe read, Pointer escape
579*12c85518Srobert}
580*12c85518Srobert```
581*12c85518Srobert
582*12c85518Srobert## Example: finding dead stores
583*12c85518Srobert
584*12c85518SrobertLet's say we want to find redundant stores, because they indicate potential
585*12c85518Srobertbugs.
586*12c85518Srobert
587*12c85518Srobert```c++
588*12c85518Srobertx = GetX();
589*12c85518Srobertx = GetY();
590*12c85518Srobert```
591*12c85518Srobert
592*12c85518SrobertThe first store to `x` is never read, probably there is a bug.
593*12c85518Srobert
594*12c85518SrobertThe implementation of dead store analysis is very similar to output parameter
595*12c85518Srobertanalysis: we need to track stores and loads, and find stores that were never
596*12c85518Srobertread.
597*12c85518Srobert
598*12c85518Srobert[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
599*12c85518Srobertgeneralization of this idea, which is often used to answer many related
600*12c85518Srobertquestions, for example:
601*12c85518Srobert
602*12c85518Srobert* finding dead stores,
603*12c85518Srobert* finding uninitialized variables,
604*12c85518Srobert* finding a good point to deallocate memory,
605*12c85518Srobert* finding out if it would be safe to move an object.
606*12c85518Srobert
607*12c85518Srobert## Example: definitive initialization
608*12c85518Srobert
609*12c85518SrobertDefinitive initialization proves that variables are known to be initialized when
610*12c85518Srobertread. If we find a variable which is read when not initialized then we generate
611*12c85518Sroberta warning.
612*12c85518Srobert
613*12c85518Srobert```c++
614*12c85518Srobertvoid Init() {
615*12c85518Srobert  int x;    // x is uninitialized
616*12c85518Srobert  if (cond()) {
617*12c85518Srobert    x = 10; // x is initialized
618*12c85518Srobert  } else {
619*12c85518Srobert    x = 20; // x is initialized
620*12c85518Srobert  }
621*12c85518Srobert  print(x); // x is initialized
622*12c85518Srobert}
623*12c85518Srobert```
624*12c85518Srobert
625*12c85518Srobert```c++
626*12c85518Srobertvoid Uninit() {
627*12c85518Srobert  int x;    // x is uninitialized
628*12c85518Srobert  if (cond()) {
629*12c85518Srobert    x = 10; // x is initialized
630*12c85518Srobert  }
631*12c85518Srobert  print(x); // x is maybe uninitialized, x is being read, report a bug.
632*12c85518Srobert}
633*12c85518Srobert```
634*12c85518Srobert
635*12c85518SrobertFor this purpose we can use lattice in a form of a mapping from variable
636*12c85518Srobertdeclarations to initialization states; each initialization state is represented
637*12c85518Srobertby the followingn lattice:
638*12c85518Srobert
639*12c85518Srobert![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)
640*12c85518Srobert
641*12c85518SrobertA lattice element could also capture the source locations of the branches that
642*12c85518Srobertlead us to the corresponding program point. Diagnostics would use this
643*12c85518Srobertinformation to show a sample buggy code path to the user.
644*12c85518Srobert
645*12c85518Srobert## Example: refactoring raw pointers to `unique_ptr`
646*12c85518Srobert
647*12c85518SrobertModern idiomatic C++ uses smart pointers to express memory ownership, however in
648*12c85518Srobertpre-C++11 code one can often find raw pointers that own heap memory blocks.
649*12c85518Srobert
650*12c85518SrobertImagine that we would like to refactor raw pointers that own memory to
651*12c85518Srobert`unique_ptr`. There are multiple ways to design a data flow analysis for this
652*12c85518Srobertproblem; let's look at one way to do it.
653*12c85518Srobert
654*12c85518SrobertFor example, we would like to refactor the following code that uses raw
655*12c85518Srobertpointers:
656*12c85518Srobert
657*12c85518Srobert```c++
658*12c85518Srobertvoid UniqueOwnership1() {
659*12c85518Srobert  int *pi = new int;
660*12c85518Srobert  if (...) {
661*12c85518Srobert    Borrow(pi);
662*12c85518Srobert    delete pi;
663*12c85518Srobert  } else {
664*12c85518Srobert    TakeOwnership(pi);
665*12c85518Srobert  }
666*12c85518Srobert}
667*12c85518Srobert```
668*12c85518Srobert
669*12c85518Srobertinto code that uses `unique_ptr`:
670*12c85518Srobert
671*12c85518Srobert```c++
672*12c85518Srobertvoid UniqueOwnership1() {
673*12c85518Srobert  auto pi = std::make_unique<int>();
674*12c85518Srobert  if (...) {
675*12c85518Srobert    Borrow(pi.get());
676*12c85518Srobert  } else {
677*12c85518Srobert    TakeOwnership(pi.release());
678*12c85518Srobert  }
679*12c85518Srobert}
680*12c85518Srobert```
681*12c85518Srobert
682*12c85518SrobertThis problem can be solved with a lattice in form of map from value declarations
683*12c85518Srobertto pointer states:
684*12c85518Srobert
685*12c85518Srobert![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)
686*12c85518Srobert
687*12c85518SrobertWe can perform the refactoring if at the exit of a function `pi` is
688*12c85518Srobert`Compatible`.
689*12c85518Srobert
690*12c85518Srobert```c++
691*12c85518Srobertvoid UniqueOwnership1() {
692*12c85518Srobert  int *pi;             // pi is Compatible
693*12c85518Srobert  pi = new int;        // pi is Defined
694*12c85518Srobert  if (...) {
695*12c85518Srobert    Borrow(pi);        // pi is Defined
696*12c85518Srobert    delete pi;         // pi is Compatible
697*12c85518Srobert  } else {
698*12c85518Srobert    TakeOwnership(pi); // pi is Compatible
699*12c85518Srobert  }
700*12c85518Srobert  // pi is Compatible
701*12c85518Srobert}
702*12c85518Srobert```
703*12c85518Srobert
704*12c85518SrobertLet's look at an example where the raw pointer owns two different memory blocks:
705*12c85518Srobert
706*12c85518Srobert```c++
707*12c85518Srobertvoid UniqueOwnership2() {
708*12c85518Srobert  int *pi = new int;  // pi is Defined
709*12c85518Srobert  Borrow(pi);
710*12c85518Srobert  delete pi;          // pi is Compatible
711*12c85518Srobert  if (smth) {
712*12c85518Srobert    pi = new int;     // pi is Defined
713*12c85518Srobert    Borrow(pi);
714*12c85518Srobert    delete pi;        // pi is Compatible
715*12c85518Srobert  }
716*12c85518Srobert  // pi is Compatible
717*12c85518Srobert}
718*12c85518Srobert```
719*12c85518Srobert
720*12c85518SrobertIt can be refactored to use `unique_ptr` like this:
721*12c85518Srobert
722*12c85518Srobert```c++
723*12c85518Srobertvoid UniqueOwnership2() {
724*12c85518Srobert  auto pi = make_unique<int>();
725*12c85518Srobert  Borrow(pi);
726*12c85518Srobert  if (smth) {
727*12c85518Srobert    pi = make_unique<int>();
728*12c85518Srobert    Borrow(pi);
729*12c85518Srobert  }
730*12c85518Srobert}
731*12c85518Srobert```
732*12c85518Srobert
733*12c85518SrobertIn the following example, the raw pointer is used to access the heap object
734*12c85518Srobertafter the ownership has been transferred.
735*12c85518Srobert
736*12c85518Srobert```c++
737*12c85518Srobertvoid UniqueOwnership3() {
738*12c85518Srobert  int *pi = new int; // pi is Defined
739*12c85518Srobert  if (...) {
740*12c85518Srobert    Borrow(pi);
741*12c85518Srobert    delete pi;       // pi is Compatible
742*12c85518Srobert  } else {
743*12c85518Srobert    vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
744*12c85518Srobert    print(*pi);
745*12c85518Srobert    use(v);
746*12c85518Srobert  }
747*12c85518Srobert  // pi is Compatible
748*12c85518Srobert}
749*12c85518Srobert```
750*12c85518Srobert
751*12c85518SrobertWe can refactor this code to use `unique_ptr`, however we would have to
752*12c85518Srobertintroduce a non-owning pointer variable, since we can't use the moved-from
753*12c85518Srobert`unique_ptr` to access the object:
754*12c85518Srobert
755*12c85518Srobert```c++
756*12c85518Srobertvoid UniqueOwnership3() {
757*12c85518Srobert  std::unique_ptr<int> pi = std::make_unique<int>();
758*12c85518Srobert  if (...) {
759*12c85518Srobert    Borrow(pi);
760*12c85518Srobert  } else {
761*12c85518Srobert    int *pi_non_owning = pi.get();
762*12c85518Srobert    vector<unique_ptr<int>> v = {std::move(pi)};
763*12c85518Srobert    print(*pi_non_owning);
764*12c85518Srobert    use(v);
765*12c85518Srobert  }
766*12c85518Srobert}
767*12c85518Srobert```
768*12c85518Srobert
769*12c85518SrobertIf the original code didn't call `delete` at the very end of the function, then
770*12c85518Srobertour refactoring may change the point at which we run the destructor and release
771*12c85518Srobertmemory. Specifically, if there is some user code after `delete`, then extending
772*12c85518Srobertthe lifetime of the object until the end of the function may hold locks for
773*12c85518Srobertlonger than necessary, introduce memory overhead etc.
774*12c85518Srobert
775*12c85518SrobertOne solution is to always replace `delete` with a call to `reset()`, and then
776*12c85518Srobertperform another analysis that removes unnecessary `reset()` calls.
777*12c85518Srobert
778*12c85518Srobert```c++
779*12c85518Srobertvoid AddedMemoryOverhead() {
780*12c85518Srobert  HugeObject *ho = new HugeObject();
781*12c85518Srobert  use(ho);
782*12c85518Srobert  delete ho; // Release the large amount of memory quickly.
783*12c85518Srobert  LongRunningFunction();
784*12c85518Srobert}
785*12c85518Srobert```
786*12c85518Srobert
787*12c85518SrobertThis analysis will refuse to refactor code that mixes borrowed pointer values
788*12c85518Srobertand unique ownership. In the following code, `GetPtr()` returns a borrowed
789*12c85518Srobertpointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
790*12c85518Srobertpointer. We don't distinguish between these two assignments, and we want each
791*12c85518Srobertassignment to be paired with a corresponding sink; otherwise, we transition the
792*12c85518Srobertpointer to a `Conflicting` state, like in this example.
793*12c85518Srobert
794*12c85518Srobert```c++
795*12c85518Srobertvoid ConflictingOwnership() {
796*12c85518Srobert  int *pi;           // pi is Compatible
797*12c85518Srobert  pi = GetPtr();     // pi is Defined
798*12c85518Srobert  Borrow(pi);        // pi is Defined
799*12c85518Srobert
800*12c85518Srobert  pi = new int;      // pi is Conflicting
801*12c85518Srobert  Borrow(pi);
802*12c85518Srobert  delete pi;
803*12c85518Srobert  // pi is Conflicting
804*12c85518Srobert}
805*12c85518Srobert```
806*12c85518Srobert
807*12c85518SrobertWe could still handle this case by finding a maximal range in the code where
808*12c85518Srobert`pi` could be in the Compatible state, and only refactoring that part.
809*12c85518Srobert
810*12c85518Srobert```c++
811*12c85518Srobertvoid ConflictingOwnership() {
812*12c85518Srobert  int *pi;
813*12c85518Srobert  pi = GetPtr();
814*12c85518Srobert  Borrow(pi);
815*12c85518Srobert
816*12c85518Srobert  std::unique_ptr<int> pi_unique = std::make_unique<int>();
817*12c85518Srobert  Borrow(pi_unique.get());
818*12c85518Srobert}
819*12c85518Srobert```
820*12c85518Srobert
821*12c85518Srobert## Example: finding redundant branch conditions
822*12c85518Srobert
823*12c85518SrobertIn the code below `b1` should not be checked in both the outer and inner "if"
824*12c85518Srobertstatements. It is likely there is a bug in this code.
825*12c85518Srobert
826*12c85518Srobert```c++
827*12c85518Srobertint F(bool b1, bool b2) {
828*12c85518Srobert  if (b1) {
829*12c85518Srobert    f();
830*12c85518Srobert    if (b1 && b2) {  // Check `b1` again -- unnecessary!
831*12c85518Srobert      g();
832*12c85518Srobert    }
833*12c85518Srobert  }
834*12c85518Srobert}
835*12c85518Srobert```
836*12c85518Srobert
837*12c85518SrobertA checker that finds this pattern syntactically is already implemented in
838*12c85518SrobertClangTidy using AST matchers (`bugprone-redundant-branch-condition`).
839*12c85518Srobert
840*12c85518SrobertTo implement it using the data flow analysis framework, we can produce a warning
841*12c85518Srobertif any part of the branch condition is implied by the flow condition.
842*12c85518Srobert
843*12c85518Srobert```c++
844*12c85518Srobertint F(bool b1, bool b2) {
845*12c85518Srobert  // Flow condition: true.
846*12c85518Srobert  if (b1) {
847*12c85518Srobert    // Flow condition: b1.
848*12c85518Srobert    f();
849*12c85518Srobert    if (b1 && b2) { // `b1` is implied by the flow condition.
850*12c85518Srobert      g();
851*12c85518Srobert    }
852*12c85518Srobert  }
853*12c85518Srobert}
854*12c85518Srobert```
855*12c85518Srobert
856*12c85518SrobertOne way to check this implication is to use a SAT solver. Without a SAT solver,
857*12c85518Srobertwe could keep the flow condition in the CNF form and then it would be easy to
858*12c85518Srobertcheck the implication.
859*12c85518Srobert
860*12c85518Srobert## Example: finding unchecked `std::optional` unwraps
861*12c85518Srobert
862*12c85518SrobertCalling `optional::value()` is only valid if `optional::has_value()` is true. We
863*12c85518Srobertwant to show that when `x.value()` is executed, the flow condition implies
864*12c85518Srobert`x.has_value()`.
865*12c85518Srobert
866*12c85518SrobertIn the example below `x.value()` is accessed safely because it is guarded by the
867*12c85518Srobert`x.has_value()` check.
868*12c85518Srobert
869*12c85518Srobert```c++
870*12c85518Srobertvoid Example(std::optional<int> &x) {
871*12c85518Srobert  if (x.has_value()) {
872*12c85518Srobert    use(x.value());
873*12c85518Srobert  }
874*12c85518Srobert}
875*12c85518Srobert```
876*12c85518Srobert
877*12c85518SrobertWhile entering the if branch we deduce that `x.has_value()` is implied by the
878*12c85518Srobertflow condition.
879*12c85518Srobert
880*12c85518Srobert```c++
881*12c85518Srobertvoid Example(std::optional<int> x) {
882*12c85518Srobert  // Flow condition: true.
883*12c85518Srobert  if (x.has_value()) {
884*12c85518Srobert    // Flow condition: x.has_value() == true.
885*12c85518Srobert    use(x.value());
886*12c85518Srobert  }
887*12c85518Srobert  // Flow condition: true.
888*12c85518Srobert}
889*12c85518Srobert```
890*12c85518Srobert
891*12c85518SrobertWe also need to prove that `x` is not modified between check and value access.
892*12c85518SrobertThe modification of `x` may be very subtle:
893*12c85518Srobert
894*12c85518Srobert```c++
895*12c85518Srobertvoid F(std::optional<int> &x);
896*12c85518Srobert
897*12c85518Srobertvoid Example(std::optional<int> &x) {
898*12c85518Srobert  if (x.has_value()) {
899*12c85518Srobert    // Flow condition: x.has_value() == true.
900*12c85518Srobert    unknown_function(x); // may change x.
901*12c85518Srobert    // Flow condition: true.
902*12c85518Srobert    use(x.value());
903*12c85518Srobert  }
904*12c85518Srobert}
905*12c85518Srobert```
906*12c85518Srobert
907*12c85518Srobert## Example: finding dead code behind A/B experiment flags
908*12c85518Srobert
909*12c85518SrobertFinding dead code is a classic application of data flow analysis.
910*12c85518Srobert
911*12c85518SrobertUnused flags for A/B experiment hide dead code. However, this flavor of dead
912*12c85518Srobertcode is invisible to the compiler because the flag can be turned on at any
913*12c85518Srobertmoment.
914*12c85518Srobert
915*12c85518SrobertWe could make a tool that deletes experiment flags. The user tells us which flag
916*12c85518Srobertthey want to delete, and we assume that the it's value is a given constant.
917*12c85518Srobert
918*12c85518SrobertFor example, the user could use the tool to remove `example_flag` from this
919*12c85518Srobertcode:
920*12c85518Srobert
921*12c85518Srobert```c++
922*12c85518SrobertDEFINE_FLAG(std::string, example_flag, "", "A sample flag.");
923*12c85518Srobert
924*12c85518Srobertvoid Example() {
925*12c85518Srobert  bool x = GetFlag(FLAGS_example_flag).empty();
926*12c85518Srobert  f();
927*12c85518Srobert  if (x) {
928*12c85518Srobert    g();
929*12c85518Srobert  } else {
930*12c85518Srobert    h();
931*12c85518Srobert  }
932*12c85518Srobert}
933*12c85518Srobert```
934*12c85518Srobert
935*12c85518SrobertThe tool would simplify the code to:
936*12c85518Srobert
937*12c85518Srobert```c++
938*12c85518Srobertvoid Example() {
939*12c85518Srobert  f();
940*12c85518Srobert  g();
941*12c85518Srobert}
942*12c85518Srobert```
943*12c85518Srobert
944*12c85518SrobertWe can solve this problem with a classic constant propagation lattice combined
945*12c85518Srobertwith symbolic evaluation.
946*12c85518Srobert
947*12c85518Srobert## Example: finding inefficient usages of associative containers
948*12c85518Srobert
949*12c85518SrobertReal-world code often accidentally performs repeated lookups in associative
950*12c85518Srobertcontainers:
951*12c85518Srobert
952*12c85518Srobert```c++
953*12c85518Srobertmap<int, Employee> xs;
954*12c85518Srobertxs[42]->name = "...";
955*12c85518Srobertxs[42]->title = "...";
956*12c85518Srobert```
957*12c85518Srobert
958*12c85518SrobertTo find the above inefficiency we can use the available expressions analysis to
959*12c85518Srobertunderstand that `m[42]` is evaluated twice.
960*12c85518Srobert
961*12c85518Srobert```c++
962*12c85518Srobertmap<int, Employee> xs;
963*12c85518SrobertEmployee &e = xs[42];
964*12c85518Sroberte->name = "...";
965*12c85518Sroberte->title = "...";
966*12c85518Srobert```
967*12c85518Srobert
968*12c85518SrobertWe can also track the `m.contains()` check in the flow condition to find
969*12c85518Srobertredundant checks, like in the example below.
970*12c85518Srobert
971*12c85518Srobert```c++
972*12c85518Srobertstd::map<int, Employee> xs;
973*12c85518Srobertif (!xs.contains(42)) {
974*12c85518Srobert  xs.insert({42, someEmployee});
975*12c85518Srobert}
976*12c85518Srobert```
977*12c85518Srobert
978*12c85518Srobert## Example: refactoring types that implicitly convert to each other
979*12c85518Srobert
980*12c85518SrobertRefactoring one strong type to another is difficult, but the compiler can help:
981*12c85518Srobertonce you refactor one reference to the type, the compiler will flag other places
982*12c85518Srobertwhere this information flows with type mismatch errors. Unfortunately this
983*12c85518Srobertstrategy does not work when you are refactoring types that implicitly convert to
984*12c85518Sroberteach other, for example, replacing `int32_t` with `int64_t`.
985*12c85518Srobert
986*12c85518SrobertImagine that we want to change user IDs from 32 to 64-bit integers. In other
987*12c85518Srobertwords, we need to find all integers tainted with user IDs. We can use data flow
988*12c85518Srobertanalysis to implement taint analysis.
989*12c85518Srobert
990*12c85518Srobert```c++
991*12c85518Srobertvoid UseUser(int32_t user_id) {
992*12c85518Srobert  int32_t id = user_id;
993*12c85518Srobert  // Variable `id` is tainted with a user ID.
994*12c85518Srobert  ...
995*12c85518Srobert}
996*12c85518Srobert```
997*12c85518Srobert
998*12c85518SrobertTaint analysis is very well suited to this problem because the program rarely
999*12c85518Srobertbranches on user IDs, and almost certainly does not perform any computation
1000*12c85518Srobert(like arithmetic).
1001