xref: /llvm-project/clang/docs/DataFlowAnalysisIntro.md (revision cc8237d9d727b864fe65d2c9d620f3332f4235ca)
1# Data flow analysis: an informal introduction
2
3## Abstract
4
5This document introduces data flow analysis in an informal way. The goal is to
6give the reader an intuitive understanding of how it works, and show how it
7applies to a range of refactoring and bug finding problems.
8
9Data flow analysis is a well-established technique; it is described in many
10papers, books, and videos. If you would like a more formal, or a more thorough
11explanation of the concepts mentioned in this document, please refer to the
12following resources:
13
14*   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
15*   Videos on the PacketPrep YouTube channel that introduce lattices and the
16    necessary background information:
17    [#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
18    [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
19    [#22](https://www.youtube.com/watch?v=saOG7Uooeho),
20    [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
21    [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
22    [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
23*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
24*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
25*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
26*   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
27*   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
28    (a paper that successfully applies symbolic execution to real-world
29    software).
30
31## Data flow analysis
32
33### The purpose of data flow analysis
34
35Data flow analysis is a static analysis technique that proves facts about a
36program or its fragment. It can make conclusions about all paths through the
37program, while taking control flow into account and scaling to large programs.
38The basic idea is propagating facts about the program through the edges of the
39control flow graph (CFG) until a fixpoint is reached.
40
41### Sample problem and an ad-hoc solution
42
43We would like to explain data flow analysis while discussing an example. Let's
44imagine that we want to track possible values of an integer variable in our
45program. Here is how a human could annotate the code:
46
47```c++
48void Example(int n) {
49  int x = 0;
50  // x is {0}
51  if (n > 0) {
52    x = 5;
53    // x is {5}
54  } else {
55    x = 42;
56    // x is {42}
57  }
58  // x is {5; 42}
59  print(x);
60}
61```
62
63We use sets of integers to represent possible values of `x`. Local variables
64have unambiguous values between statements, so we annotate program points
65between statements with sets of possible values.
66
67Here is how we arrived at these annotations. Assigning a constant to `x` allows
68us to make a conclusion that `x` can only have one value. When control flow from
69the "then" and "else" branches joins, `x` can have either value.
70
71Abstract algebra provides a nice formalism that models this kind of structure,
72namely, a lattice. A join-semilattice is a partially ordered set, in which every
73two elements have a least upper bound (called a *join*).
74
75```
76join(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x
77```
78
79For this problem we will use the lattice of subsets of integers, with set
80inclusion relation as ordering and set union as a join.
81
82Lattices are often represented visually as Hasse diagrams. Here is a Hasse
83diagram for our lattice that tracks subsets of integers:
84
85![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)
86
87Computing the join in the lattice corresponds to finding the lowest common
88ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
89literature on efficiently implementing LCA queries for a DAG, however Efficient
90Implementation of Lattice Operations (1989)
91([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
92[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
93particularly well-suited for programmatic implementation.
94
95### Too much information and "top" values
96
97Let's try to find the possible sets of values of `x` in a function that modifies
98`x` in a loop:
99
100```c++
101void ExampleOfInfiniteSets() {
102  int x = 0; // x is {0}
103  while (condition()) {
104    x += 1;  // x is {0; 1; 2; …}
105  }
106  print(x);  // x is {0; 1; 2; …}
107}
108```
109
110We have an issue: `x` can have any value greater than zero; that's an infinite
111set of values, if the program operated on mathematical integers. In C++ `int` is
112limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
113still really big.
114
115To make our analysis practical to compute, we have to limit the amount of
116information that we track. In this case, we can, for example, arbitrarily limit
117the size of sets to 3 elements. If at a certain program point `x` has more than
1183 possible values, we stop tracking specific values at that program point.
119Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
120according to a convention in abstract algebra).
121
122```c++
123void ExampleOfTopWithALoop() {
124  int x = 0;  // x is {0}
125  while (condition()) {
126    x += 1;   // x is ⊤
127  }
128  print(x);   // x is ⊤
129}
130```
131
132The statement "at this program point, `x`'s possible values are `⊤`" is
133understood as "at this program point `x` can have any value because we have too
134much information, or the information is conflicting".
135
136Note that we can get more than 3 possible values even without a loop:
137
138```c++
139void ExampleOfTopWithoutLoops(int n) {
140  int x = 0;  // x is {0}
141  switch(n) {
142    case 0:  x = 1; break; // x is {1}
143    case 1:  x = 9; break; // x is {9}
144    case 2:  x = 7; break; // x is {7}
145    default: x = 3; break; // x is {3}
146  }
147  // x is ⊤
148}
149```
150
151### Uninitialized variables and "bottom" values
152
153When `x` is declared but not initialized, it has no possible values. We
154represent this fact symbolically as `⊥` (pronounced "bottom").
155
156```c++
157void ExampleOfBottom() {
158  int x;    // x is ⊥
159  x = 42;   // x is {42}
160  print(x);
161}
162```
163
164Note that using values read from uninitialized variables is undefined behaviour
165in C++. Generally, compilers and static analysis tools can assume undefined
166behavior does not happen. We must model uninitialized variables only when we are
167implementing a checker that specifically is trying to find uninitialized reads.
168In this example we show how to model uninitialized variables only to demonstrate
169the concept of "bottom", and how it applies to possible value analysis. We
170describe an analysis that finds uninitialized reads in a section below.
171
172### A practical lattice that tracks sets of concrete values
173
174Taking into account all corner cases covered above, we can put together a
175lattice that we can use in practice to track possible values of integer
176variables. This lattice represents sets of integers with 1, 2, or 3 elements, as
177well as top and bottom. Here is a Hasse diagram for it:
178
179![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)
180
181### Formalization
182
183Let's consider a slightly more complex example, and think about how we can
184compute the sets of possible values algorithmically.
185
186```c++
187void Example(int n) {
188  int x;          // x is ⊥
189  if (n > 0) {
190    if (n == 42) {
191       x = 44;    // x is {44}
192    } else {
193       x = 5;     // x is {5}
194    }
195    print(x);     // x is {44; 5}
196  } else {
197    x = n;        // x is ⊤
198  }
199  print(x);       // x is ⊤
200}
201```
202
203As humans, we understand the control flow from the program text. We used our
204understanding of control flow to find program points where two flows join.
205Formally, control flow is represented by a CFG (control flow graph):
206
207![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)
208
209We can compute sets of possible values by propagating them through the CFG of
210the function:
211
212*   When `x` is declared but not initialized, its possible values are `{}`. The
213    empty set plays the role of `⊥` in this lattice.
214
215*   When `x` is assigned a concrete value, its possible set of values contains
216    just that specific value.
217
218*   When `x` is assigned some unknown value, it can have any value. We represent
219    this fact as `⊤`.
220
221*   When two control flow paths join, we compute the set union of incoming
222    values (limiting the number of elements to 3, representing larger sets as
223    `⊤`).
224
225The sets of possible values are influenced by:
226
227*   Statements, for example, assignments.
228
229*   Joins in control flow, for example, ones that appear at the end of "if"
230    statements.
231
232**Effects of statements** are modeled by what is formally known as a transfer
233function. A transfer function takes two arguments: the statement, and the state
234of `x` at the previous program point. It produces the state of `x` at the next
235program point. For example, the transfer function for assignment ignores the
236state at the previous program point:
237
238```c++
239// GIVEN: x is {42; 44}
240x = 0;
241// CONCLUSION: x is {0}
242```
243
244The transfer function for `+` performs arithmetic on every set member:
245
246```c++
247// GIVEN: x is {42, 44}
248x = x + 100;
249// CONCLUSION: x is {142, 144}
250```
251
252**Effects of control flow** are modeled by joining the knowledge from all
253possible previous program points.
254
255```c++
256if (...) {
257  ...
258  // GIVEN: x is {42}
259} else {
260  ...
261  // GIVEN: x is {44}
262}
263// CONCLUSION: x is {42; 44}
264```
265
266```c++
267// GIVEN: x is {42}
268while (...) {
269  ...
270  // GIVEN: x is {44}
271}
272// CONCLUSION: {42; 44}
273```
274
275The predicate that we marked "given" is usually called a precondition, and the
276conclusion is called a postcondition.
277
278In terms of the CFG, we join the information from all predecessor basic blocks.
279
280![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)
281
282Putting it all together, to model the effects of a basic block we compute:
283
284```
285out = transfer(basic_block, join(in_1, in_2, ..., in_n))
286```
287
288(Note that there are other ways to write this equation that produce higher
289precision analysis results. The trick is to keep exploring the execution paths
290separately and delay joining until later. However, we won't discuss those
291variations here.)
292
293To make a conclusion about all paths through the program, we repeat this
294computation on all basic blocks until we reach a fixpoint. In other words, we
295keep propagating information through the CFG until the computed sets of values
296stop changing.
297
298If the lattice has a finite height and transfer functions are monotonic the
299algorithm is guaranteed to terminate.  Each iteration of the algorithm can
300change computed values only to larger values from the lattice. In the worst
301case, all computed values become `⊤`, which is not very useful, but at least the
302analysis terminates at that point, because it can't change any of the values.
303
304Fixpoint iteration can be optimised by only reprocessing basic blocks which had
305one of their inputs changed on the previous iteration. This is typically
306implemented using a worklist queue. With this optimisation the time complexity
307becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
308`|L|` is the size of lattice used by the analysis.
309
310## Symbolic execution: a very short informal introduction
311
312### Symbolic values
313
314In the previous example where we tried to figure out what values a variable can
315have, the analysis had to be seeded with a concrete value. What if there are no
316assignments of concrete values in the program? We can still deduce some
317interesting information by representing unknown input values symbolically, and
318computing results as symbolic expressions:
319
320```c++
321void PrintAbs(int x) {
322  int result;
323  if (x >= 0) {
324    result = x;   // result is {x}
325  } else {
326    result = -x;  // result is {-x}
327  }
328  print(result);  // result is {x; -x}
329}
330```
331
332We can't say what specific value gets printed, but we know that it is either `x`
333or `-x`.
334
335Dataflow analysis is an instance of abstract interpretation, and does not dictate
336how exactly the lattice and transfer functions should be designed, beyond the
337necessary conditions for the analysis to converge. Nevertheless, we can use
338symbolic execution ideas to guide our design of the lattice and transfer
339functions: lattice values can be symbolic expressions, and transfer functions
340can construct more complex symbolic expressions from symbolic expressions that
341represent arguments. See [this StackOverflow
342discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
343for a further comparison of abstract interpretation and symbolic execution.
344
345### Flow condition
346
347A human can say about the previous example that the function returns `x` when
348`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
349tracking a flow condition. A flow condition is a predicate written in terms of
350the program state that is true at a specific program point regardless of the
351execution path that led to this statement. For example, the flow condition for
352the program point right before evaluating `result = x` is `x >= 0`.
353
354If we enhance the lattice to be a set of pairs of values and predicates, the
355dataflow analysis computes the following values:
356
357```c++
358void PrintAbs(int x) {
359  int result;
360  if (x >= 0) {
361    // Flow condition: x >= 0.
362    result = x;   // result is {x if x >= 0}
363  } else {
364    // Flow condition: x < 0.
365    result = -x;  // result is {-x if x < 0}
366  }
367  print(result);  // result is {x if x >= 0; -x if x < 0}
368}
369```
370
371Of course, in a program with loops, symbolic expressions for flow conditions can
372grow unbounded. A practical static analysis system must control this growth to
373keep the symbolic representations manageable and ensure that the data flow
374analysis terminates. For example, it can use a constraint solver to prune
375impossible flow conditions, and/or it can abstract them, losing precision, after
376their symbolic representations grow beyond some threshold. This is similar to
377how we had to limit the sizes of computed sets of possible values to 3 elements.
378
379### Symbolic pointers
380
381This approach proves to be particularly useful for modeling pointer values,
382since we don't care about specific addresses but just want to give a unique
383identifier to a memory location.
384
385```c++
386void ExampleOfSymbolicPointers(bool b) {
387  int x = 0;     // x is {0}
388  int* ptr = &x; // x is {0}      ptr is {&x}
389  if (b) {
390    *ptr = 42;   // x is {42}     ptr is {&x}
391  }
392  print(x);      // x is {0; 42}  ptr is {&x}
393}
394```
395
396## Example: finding output parameters
397
398Let's explore how data flow analysis can help with a problem that is hard to
399solve with other tools in Clang.
400
401### Problem description
402
403Output parameters are function parameters of pointer or reference type whose
404pointee is completely overwritten by the function, and not read before it is
405overwritten. They are common in pre-C++11 code due to the absence of move
406semantics. In modern C++ output parameters are non-idiomatic, and return values
407are used instead.
408
409Imagine that we would like to refactor output parameters to return values to
410modernize old code. The first step is to identify refactoring candidates through
411static analysis.
412
413For example, in the following code snippet the pointer `c` is an output
414parameter:
415
416```c++
417struct Customer {
418  int account_id;
419  std::string name;
420}
421
422void GetCustomer(Customer *c) {
423  c->account_id = ...;
424  if (...) {
425    c->name = ...;
426  } else {
427    c->name = ...;
428  }
429}
430```
431
432We would like to refactor this code into:
433
434```c++
435Customer GetCustomer() {
436  Customer c;
437  c.account_id = ...;
438  if (...) {
439    c.name = ...;
440  } else {
441    c.name = ...;
442  }
443  return c;
444}
445```
446
447However, in the function below the parameter `c` is not an output parameter
448because its field `name` is not overwritten on every path through the function.
449
450```c++
451void GetCustomer(Customer *c) {
452  c->account_id = ...;
453  if (...) {
454    c->name = ...;
455  }
456}
457```
458
459The code also cannot read the value of the parameter before overwriting it:
460
461```c++
462void GetCustomer(Customer *c) {
463  use(c->account_id);
464  c->name = ...;
465  c->account_id = ...;
466}
467```
468
469Functions that escape the pointer also block the refactoring:
470
471```c++
472Customer* kGlobalCustomer;
473
474void GetCustomer(Customer *c) {
475  c->name = ...;
476  c->account_id = ...;
477  kGlobalCustomer = c;
478}
479```
480
481To identify a candidate function for refactoring, we need to do the following:
482
483*   Find a function with a non-const pointer or reference parameter.
484
485*   Find the definition of that function.
486
487*   Prove that the function completely overwrites the pointee on all paths
488    before returning.
489
490*   Prove that the function reads the pointee only after overwriting it.
491
492*   Prove that the function does not persist the pointer in a data structure
493    that is live after the function returns.
494
495There are also requirements that all usage sites of the candidate function must
496satisfy, for example, that function arguments do not alias, that users are not
497taking the address of the function, and so on. Let's consider verifying usage
498site conditions to be a separate static analysis problem.
499
500### Lattice design
501
502To analyze the function body we can use a lattice which consists of normal
503states and failure states. A normal state describes program points where we are
504sure that no behaviors that block the refactoring have occurred. Normal states
505keep track of all parameter's member fields that are known to be overwritten on
506every path from function entry to the corresponding program point. Failure
507states accumulate observed violations (unsafe reads and pointer escapes) that
508block the refactoring.
509
510In the partial order of the lattice failure states compare greater than normal
511states, which guarantees that they "win" when joined with normal states. Order
512between failure states is determined by inclusion relation on the set of
513accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
514between normal states is determined by reversed inclusion relation on the set of
515overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
516overwritten fields).
517
518![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)
519
520To determine whether a statement reads or writes a field we can implement
521symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
522dereference operator and `MemberExpr`s.
523
524### Using data flow results to identify output parameters
525
526Let's take a look at how we use data flow analysis to identify an output
527parameter. The refactoring can be safely done when the data flow algorithm
528computes a normal state with all of the fields proven to be overwritten in the
529exit basic block of the function.
530
531```c++
532struct Customer {
533  int account_id;
534  std::string name;
535};
536
537void GetCustomer(Customer* c) {
538  // Overwritten: {}
539  c->account_id = ...; // Overwritten: {c->account_id}
540  if (...) {
541    c->name = ...;     // Overwritten: {c->account_id, c->name}
542  } else {
543    c->name = ...;     // Overwritten: {c->account_id, c->name}
544  }
545  // Overwritten: {c->account_id, c->name}
546}
547```
548
549When the data flow algorithm computes a normal state, but not all fields are
550proven to be overwritten we can't perform the refactoring.
551
552```c++
553void target(bool b, Customer* c) {
554  // Overwritten: {}
555  if (b) {
556    c->account_id = 42;     // Overwritten: {c->account_id}
557  } else {
558    c->name = "Konrad";  // Overwritten: {c->name}
559  }
560  // Overwritten: {}
561}
562```
563
564Similarly, when the data flow algorithm computes a failure state, we also can't
565perform the refactoring.
566
567```c++
568Customer* kGlobalCustomer;
569
570void GetCustomer(Customer* c) {
571  // Overwritten: {}
572  c->account_id = ...;    // Overwritten: {c->account_id}
573  if (...) {
574    print(c->name);       // Unsafe read
575  } else {
576    kGlobalCustomer = c;  // Pointer escape
577  }
578  // Unsafe read, Pointer escape
579}
580```
581
582## Example: finding dead stores
583
584Let's say we want to find redundant stores, because they indicate potential
585bugs.
586
587```c++
588x = GetX();
589x = GetY();
590```
591
592The first store to `x` is never read, probably there is a bug.
593
594The implementation of dead store analysis is very similar to output parameter
595analysis: we need to track stores and loads, and find stores that were never
596read.
597
598[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
599generalization of this idea, which is often used to answer many related
600questions, for example:
601
602* finding dead stores,
603* finding uninitialized variables,
604* finding a good point to deallocate memory,
605* finding out if it would be safe to move an object.
606
607## Example: definitive initialization
608
609Definitive initialization proves that variables are known to be initialized when
610read. If we find a variable which is read when not initialized then we generate
611a warning.
612
613```c++
614void Init() {
615  int x;    // x is uninitialized
616  if (cond()) {
617    x = 10; // x is initialized
618  } else {
619    x = 20; // x is initialized
620  }
621  print(x); // x is initialized
622}
623```
624
625```c++
626void Uninit() {
627  int x;    // x is uninitialized
628  if (cond()) {
629    x = 10; // x is initialized
630  }
631  print(x); // x is maybe uninitialized, x is being read, report a bug.
632}
633```
634
635For this purpose we can use lattice in a form of a mapping from variable
636declarations to initialization states; each initialization state is represented
637by the following lattice:
638
639![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)
640
641A lattice element could also capture the source locations of the branches that
642lead us to the corresponding program point. Diagnostics would use this
643information to show a sample buggy code path to the user.
644
645## Example: refactoring raw pointers to `unique_ptr`
646
647Modern idiomatic C++ uses smart pointers to express memory ownership, however in
648pre-C++11 code one can often find raw pointers that own heap memory blocks.
649
650Imagine that we would like to refactor raw pointers that own memory to
651`unique_ptr`. There are multiple ways to design a data flow analysis for this
652problem; let's look at one way to do it.
653
654For example, we would like to refactor the following code that uses raw
655pointers:
656
657```c++
658void UniqueOwnership1() {
659  int *pi = new int;
660  if (...) {
661    Borrow(pi);
662    delete pi;
663  } else {
664    TakeOwnership(pi);
665  }
666}
667```
668
669into code that uses `unique_ptr`:
670
671```c++
672void UniqueOwnership1() {
673  auto pi = std::make_unique<int>();
674  if (...) {
675    Borrow(pi.get());
676  } else {
677    TakeOwnership(pi.release());
678  }
679}
680```
681
682This problem can be solved with a lattice in form of map from value declarations
683to pointer states:
684
685![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)
686
687We can perform the refactoring if at the exit of a function `pi` is
688`Compatible`.
689
690```c++
691void UniqueOwnership1() {
692  int *pi;             // pi is Compatible
693  pi = new int;        // pi is Defined
694  if (...) {
695    Borrow(pi);        // pi is Defined
696    delete pi;         // pi is Compatible
697  } else {
698    TakeOwnership(pi); // pi is Compatible
699  }
700  // pi is Compatible
701}
702```
703
704Let's look at an example where the raw pointer owns two different memory blocks:
705
706```c++
707void UniqueOwnership2() {
708  int *pi = new int;  // pi is Defined
709  Borrow(pi);
710  delete pi;          // pi is Compatible
711  if (smth) {
712    pi = new int;     // pi is Defined
713    Borrow(pi);
714    delete pi;        // pi is Compatible
715  }
716  // pi is Compatible
717}
718```
719
720It can be refactored to use `unique_ptr` like this:
721
722```c++
723void UniqueOwnership2() {
724  auto pi = make_unique<int>();
725  Borrow(pi);
726  if (smth) {
727    pi = make_unique<int>();
728    Borrow(pi);
729  }
730}
731```
732
733In the following example, the raw pointer is used to access the heap object
734after the ownership has been transferred.
735
736```c++
737void UniqueOwnership3() {
738  int *pi = new int; // pi is Defined
739  if (...) {
740    Borrow(pi);
741    delete pi;       // pi is Compatible
742  } else {
743    vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
744    print(*pi);
745    use(v);
746  }
747  // pi is Compatible
748}
749```
750
751We can refactor this code to use `unique_ptr`, however we would have to
752introduce a non-owning pointer variable, since we can't use the moved-from
753`unique_ptr` to access the object:
754
755```c++
756void UniqueOwnership3() {
757  std::unique_ptr<int> pi = std::make_unique<int>();
758  if (...) {
759    Borrow(pi);
760  } else {
761    int *pi_non_owning = pi.get();
762    vector<unique_ptr<int>> v = {std::move(pi)};
763    print(*pi_non_owning);
764    use(v);
765  }
766}
767```
768
769If the original code didn't call `delete` at the very end of the function, then
770our refactoring may change the point at which we run the destructor and release
771memory. Specifically, if there is some user code after `delete`, then extending
772the lifetime of the object until the end of the function may hold locks for
773longer than necessary, introduce memory overhead etc.
774
775One solution is to always replace `delete` with a call to `reset()`, and then
776perform another analysis that removes unnecessary `reset()` calls.
777
778```c++
779void AddedMemoryOverhead() {
780  HugeObject *ho = new HugeObject();
781  use(ho);
782  delete ho; // Release the large amount of memory quickly.
783  LongRunningFunction();
784}
785```
786
787This analysis will refuse to refactor code that mixes borrowed pointer values
788and unique ownership. In the following code, `GetPtr()` returns a borrowed
789pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
790pointer. We don't distinguish between these two assignments, and we want each
791assignment to be paired with a corresponding sink; otherwise, we transition the
792pointer to a `Conflicting` state, like in this example.
793
794```c++
795void ConflictingOwnership() {
796  int *pi;           // pi is Compatible
797  pi = GetPtr();     // pi is Defined
798  Borrow(pi);        // pi is Defined
799
800  pi = new int;      // pi is Conflicting
801  Borrow(pi);
802  delete pi;
803  // pi is Conflicting
804}
805```
806
807We could still handle this case by finding a maximal range in the code where
808`pi` could be in the Compatible state, and only refactoring that part.
809
810```c++
811void ConflictingOwnership() {
812  int *pi;
813  pi = GetPtr();
814  Borrow(pi);
815
816  std::unique_ptr<int> pi_unique = std::make_unique<int>();
817  Borrow(pi_unique.get());
818}
819```
820
821## Example: finding redundant branch conditions
822
823In the code below `b1` should not be checked in both the outer and inner "if"
824statements. It is likely there is a bug in this code.
825
826```c++
827int F(bool b1, bool b2) {
828  if (b1) {
829    f();
830    if (b1 && b2) {  // Check `b1` again -- unnecessary!
831      g();
832    }
833  }
834}
835```
836
837A checker that finds this pattern syntactically is already implemented in
838ClangTidy using AST matchers (`bugprone-redundant-branch-condition`).
839
840To implement it using the data flow analysis framework, we can produce a warning
841if any part of the branch condition is implied by the flow condition.
842
843```c++
844int F(bool b1, bool b2) {
845  // Flow condition: true.
846  if (b1) {
847    // Flow condition: b1.
848    f();
849    if (b1 && b2) { // `b1` is implied by the flow condition.
850      g();
851    }
852  }
853}
854```
855
856One way to check this implication is to use a SAT solver. Without a SAT solver,
857we could keep the flow condition in the CNF form and then it would be easy to
858check the implication.
859
860## Example: finding unchecked `std::optional` unwraps
861
862Calling `optional::value()` is only valid if `optional::has_value()` is true. We
863want to show that when `x.value()` is executed, the flow condition implies
864`x.has_value()`.
865
866In the example below `x.value()` is accessed safely because it is guarded by the
867`x.has_value()` check.
868
869```c++
870void Example(std::optional<int> &x) {
871  if (x.has_value()) {
872    use(x.value());
873  }
874}
875```
876
877While entering the if branch we deduce that `x.has_value()` is implied by the
878flow condition.
879
880```c++
881void Example(std::optional<int> x) {
882  // Flow condition: true.
883  if (x.has_value()) {
884    // Flow condition: x.has_value() == true.
885    use(x.value());
886  }
887  // Flow condition: true.
888}
889```
890
891We also need to prove that `x` is not modified between check and value access.
892The modification of `x` may be very subtle:
893
894```c++
895void F(std::optional<int> &x);
896
897void Example(std::optional<int> &x) {
898  if (x.has_value()) {
899    // Flow condition: x.has_value() == true.
900    unknown_function(x); // may change x.
901    // Flow condition: true.
902    use(x.value());
903  }
904}
905```
906
907## Example: finding dead code behind A/B experiment flags
908
909Finding dead code is a classic application of data flow analysis.
910
911Unused flags for A/B experiment hide dead code. However, this flavor of dead
912code is invisible to the compiler because the flag can be turned on at any
913moment.
914
915We could make a tool that deletes experiment flags. The user tells us which flag
916they want to delete, and we assume that the it's value is a given constant.
917
918For example, the user could use the tool to remove `example_flag` from this
919code:
920
921```c++
922DEFINE_FLAG(std::string, example_flag, "", "A sample flag.");
923
924void Example() {
925  bool x = GetFlag(FLAGS_example_flag).empty();
926  f();
927  if (x) {
928    g();
929  } else {
930    h();
931  }
932}
933```
934
935The tool would simplify the code to:
936
937```c++
938void Example() {
939  f();
940  g();
941}
942```
943
944We can solve this problem with a classic constant propagation lattice combined
945with symbolic evaluation.
946
947## Example: finding inefficient usages of associative containers
948
949Real-world code often accidentally performs repeated lookups in associative
950containers:
951
952```c++
953map<int, Employee> xs;
954xs[42]->name = "...";
955xs[42]->title = "...";
956```
957
958To find the above inefficiency we can use the available expressions analysis to
959understand that `m[42]` is evaluated twice.
960
961```c++
962map<int, Employee> xs;
963Employee &e = xs[42];
964e->name = "...";
965e->title = "...";
966```
967
968We can also track the `m.contains()` check in the flow condition to find
969redundant checks, like in the example below.
970
971```c++
972std::map<int, Employee> xs;
973if (!xs.contains(42)) {
974  xs.insert({42, someEmployee});
975}
976```
977
978## Example: refactoring types that implicitly convert to each other
979
980Refactoring one strong type to another is difficult, but the compiler can help:
981once you refactor one reference to the type, the compiler will flag other places
982where this information flows with type mismatch errors. Unfortunately this
983strategy does not work when you are refactoring types that implicitly convert to
984each other, for example, replacing `int32_t` with `int64_t`.
985
986Imagine that we want to change user IDs from 32 to 64-bit integers. In other
987words, we need to find all integers tainted with user IDs. We can use data flow
988analysis to implement taint analysis.
989
990```c++
991void UseUser(int32_t user_id) {
992  int32_t id = user_id;
993  // Variable `id` is tainted with a user ID.
994  ...
995}
996```
997
998Taint analysis is very well suited to this problem because the program rarely
999branches on user IDs, and almost certainly does not perform any computation
1000(like arithmetic).
1001