clang/docs/DataFlowAnalysisIntro.md

*12c85518Srobert# Data flow analysis: an informal introduction
*12c85518Srobert
*12c85518Srobert## Abstract
*12c85518Srobert
*12c85518SrobertThis document introduces data flow analysis in an informal way. The goal is to
*12c85518Srobertgive the reader an intuitive understanding of how it works, and show how it
*12c85518Srobertapplies to a range of refactoring and bug finding problems.
*12c85518Srobert
*12c85518SrobertData flow analysis is a well-established technique; it is described in many
*12c85518Srobertpapers, books, and videos. If you would like a more formal, or a more thorough
*12c85518Srobertexplanation of the concepts mentioned in this document, please refer to the
*12c85518Srobertfollowing resources:
*12c85518Srobert
*12c85518Srobert*   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
*12c85518Srobert*   Videos on the PacketPrep YouTube channel that introduce lattices and the
*12c85518Srobert    necessary background information:
*12c85518Srobert    [#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
*12c85518Srobert    [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
*12c85518Srobert    [#22](https://www.youtube.com/watch?v=saOG7Uooeho),
*12c85518Srobert    [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
*12c85518Srobert    [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
*12c85518Srobert    [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
*12c85518Srobert*   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
*12c85518Srobert*   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
*12c85518Srobert*   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
*12c85518Srobert*   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
*12c85518Srobert*   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
*12c85518Srobert    (a paper that successfully applies symbolic execution to real-world
*12c85518Srobert    software).
*12c85518Srobert
*12c85518Srobert## Data flow analysis
*12c85518Srobert
*12c85518Srobert### The purpose of data flow analysis
*12c85518Srobert
*12c85518SrobertData flow analysis is a static analysis technique that proves facts about a
*12c85518Srobertprogram or its fragment. It can make conclusions about all paths through the
*12c85518Srobertprogram, while taking control flow into account and scaling to large programs.
*12c85518SrobertThe basic idea is propagating facts about the program through the edges of the
*12c85518Srobertcontrol flow graph (CFG) until a fixpoint is reached.
*12c85518Srobert
*12c85518Srobert### Sample problem and an ad-hoc solution
*12c85518Srobert
*12c85518SrobertWe would like to explain data flow analysis while discussing an example. Let's
*12c85518Srobertimagine that we want to track possible values of an integer variable in our
*12c85518Srobertprogram. Here is how a human could annotate the code:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Example(int n) {
*12c85518Srobert  int x = 0;
*12c85518Srobert  // x is {0}
*12c85518Srobert  if (n > 0) {
*12c85518Srobert    x = 5;
*12c85518Srobert    // x is {5}
*12c85518Srobert  } else {
*12c85518Srobert    x = 42;
*12c85518Srobert    // x is {42}
*12c85518Srobert  }
*12c85518Srobert  // x is {5; 42}
*12c85518Srobert  print(x);
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe use sets of integers to represent possible values of `x`. Local variables
*12c85518Sroberthave unambiguous values between statements, so we annotate program points
*12c85518Srobertbetween statements with sets of possible values.
*12c85518Srobert
*12c85518SrobertHere is how we arrived at these annotations. Assigning a constant to `x` allows
*12c85518Srobertus to make a conclusion that `x` can only have one value. When control flow from
*12c85518Srobertthe "then" and "else" branches joins, `x` can have either value.
*12c85518Srobert
*12c85518SrobertAbstract algebra provides a nice formalism that models this kind of structure,
*12c85518Srobertnamely, a lattice. A join-semilattice is a partially ordered set, in which every
*12c85518Sroberttwo elements have a least upper bound (called a *join*).
*12c85518Srobert
*12c85518Srobert```
*12c85518Srobertjoin(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertFor this problem we will use the lattice of subsets of integers, with set
*12c85518Srobertinclusion relation as ordering and set union as a join.
*12c85518Srobert
*12c85518SrobertLattices are often represented visually as Hasse diagrams. Here is a Hasse
*12c85518Srobertdiagram for our lattice that tracks subsets of integers:
*12c85518Srobert
*12c85518Srobert![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)
*12c85518Srobert
*12c85518SrobertComputing the join in the lattice corresponds to finding the lowest common
*12c85518Srobertancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
*12c85518Srobertliterature on efficiently implementing LCA queries for a DAG, however Efficient
*12c85518SrobertImplementation of Lattice Operations (1989)
*12c85518Srobert([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
*12c85518Srobert[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
*12c85518Srobertparticularly well-suited for programmatic implementation.
*12c85518Srobert
*12c85518Srobert### Too much information and "top" values
*12c85518Srobert
*12c85518SrobertLet's try to find the possible sets of values of `x` in a function that modifies
*12c85518Srobert`x` in a loop:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ExampleOfInfiniteSets() {
*12c85518Srobert  int x = 0; // x is {0}
*12c85518Srobert  while (condition()) {
*12c85518Srobert    x += 1;  // x is {0; 1; 2; …}
*12c85518Srobert  }
*12c85518Srobert  print(x);  // x is {0; 1; 2; …}
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe have an issue: `x` can have any value greater than zero; that's an infinite
*12c85518Srobertset of values, if the program operated on mathematical integers. In C++ `int` is
*12c85518Srobertlimited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
*12c85518Srobertstill really big.
*12c85518Srobert
*12c85518SrobertTo make our analysis practical to compute, we have to limit the amount of
*12c85518Srobertinformation that we track. In this case, we can, for example, arbitrarily limit
*12c85518Srobertthe size of sets to 3 elements. If at a certain program point `x` has more than
*12c85518Srobert3 possible values, we stop tracking specific values at that program point.
*12c85518SrobertInstead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
*12c85518Srobertaccording to a convention in abstract algebra).
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ExampleOfTopWithALoop() {
*12c85518Srobert  int x = 0;  // x is {0}
*12c85518Srobert  while (condition()) {
*12c85518Srobert    x += 1;   // x is ⊤
*12c85518Srobert  }
*12c85518Srobert  print(x);   // x is ⊤
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThe statement "at this program point, `x`'s possible values are `⊤`" is
*12c85518Srobertunderstood as "at this program point `x` can have any value because we have too
*12c85518Srobertmuch information, or the information is conflicting".
*12c85518Srobert
*12c85518SrobertNote that we can get more than 3 possible values even without a loop:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ExampleOfTopWithoutLoops(int n) {
*12c85518Srobert  int x = 0;  // x is {0}
*12c85518Srobert  switch(n) {
*12c85518Srobert    case 0:  x = 1; break; // x is {1}
*12c85518Srobert    case 1:  x = 9; break; // x is {9}
*12c85518Srobert    case 2:  x = 7; break; // x is {7}
*12c85518Srobert    default: x = 3; break; // x is {3}
*12c85518Srobert  }
*12c85518Srobert  // x is ⊤
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert### Uninitialized variables and "bottom" values
*12c85518Srobert
*12c85518SrobertWhen `x` is declared but not initialized, it has no possible values. We
*12c85518Srobertrepresent this fact symbolically as `⊥` (pronounced "bottom").
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ExampleOfBottom() {
*12c85518Srobert  int x;    // x is ⊥
*12c85518Srobert  x = 42;   // x is {42}
*12c85518Srobert  print(x);
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertNote that using values read from uninitialized variables is undefined behaviour
*12c85518Srobertin C++. Generally, compilers and static analysis tools can assume undefined
*12c85518Srobertbehavior does not happen. We must model uninitialized variables only when we are
*12c85518Srobertimplementing a checker that specifically is trying to find uninitialized reads.
*12c85518SrobertIn this example we show how to model uninitialized variables only to demonstrate
*12c85518Srobertthe concept of "bottom", and how it applies to possible value analysis. We
*12c85518Srobertdescribe an analysis that finds uninitialized reads in a section below.
*12c85518Srobert
*12c85518Srobert### A practical lattice that tracks sets of concrete values
*12c85518Srobert
*12c85518SrobertTaking into account all corner cases covered above, we can put together a
*12c85518Srobertlattice that we can use in practice to track possible values of integer
*12c85518Srobertvariables. This lattice represents sets of integers with 1, 2, or 3 elements, as
*12c85518Srobertwell as top and bottom. Here is a Hasse diagram for it:
*12c85518Srobert
*12c85518Srobert![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)
*12c85518Srobert
*12c85518Srobert### Formalization
*12c85518Srobert
*12c85518SrobertLet's consider a slightly more complex example, and think about how we can
*12c85518Srobertcompute the sets of possible values algorithmically.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Example(int n) {
*12c85518Srobert  int x;          // x is ⊥
*12c85518Srobert  if (n > 0) {
*12c85518Srobert    if (n == 42) {
*12c85518Srobert       x = 44;    // x is {44}
*12c85518Srobert    } else {
*12c85518Srobert       x = 5;     // x is {5}
*12c85518Srobert    }
*12c85518Srobert    print(x);     // x is {44; 5}
*12c85518Srobert  } else {
*12c85518Srobert    x = n;        // x is ⊤
*12c85518Srobert  }
*12c85518Srobert  print(x);       // x is ⊤
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertAs humans, we understand the control flow from the program text. We used our
*12c85518Srobertunderstanding of control flow to find program points where two flows join.
*12c85518SrobertFormally, control flow is represented by a CFG (control flow graph):
*12c85518Srobert
*12c85518Srobert![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)
*12c85518Srobert
*12c85518SrobertWe can compute sets of possible values by propagating them through the CFG of
*12c85518Srobertthe function:
*12c85518Srobert
*12c85518Srobert*   When `x` is declared but not initialized, its possible values are `{}`. The
*12c85518Srobert    empty set plays the role of `⊥` in this lattice.
*12c85518Srobert
*12c85518Srobert*   When `x` is assigned a concrete value, its possible set of values contains
*12c85518Srobert    just that specific value.
*12c85518Srobert
*12c85518Srobert*   When `x` is assigned some unknown value, it can have any value. We represent
*12c85518Srobert    this fact as `⊤`.
*12c85518Srobert
*12c85518Srobert*   When two control flow paths join, we compute the set union of incoming
*12c85518Srobert    values (limiting the number of elements to 3, representig larger sets as
*12c85518Srobert    `⊤`).
*12c85518Srobert
*12c85518SrobertThe sets of possible values are influenced by:
*12c85518Srobert
*12c85518Srobert*   Statements, for example, assignments.
*12c85518Srobert
*12c85518Srobert*   Joins in control flow, for example, ones that appear at the end of "if"
*12c85518Srobert    statements.
*12c85518Srobert
*12c85518Srobert**Effects of statements** are modeled by what is formally known as a transfer
*12c85518Srobertfunction. A transfer function takes two arguments: the statement, and the state
*12c85518Srobertof `x` at the previous program point. It produces the state of `x` at the next
*12c85518Srobertprogram point. For example, the transfer function for assignment ignores the
*12c85518Srobertstate at the previous program point:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobert// GIVEN: x is {42; 44}
*12c85518Srobertx = 0;
*12c85518Srobert// CONCLUSION: x is {0}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThe transfer function for `+` performs arithmetic on every set member:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobert// GIVEN: x is {42, 44}
*12c85518Srobertx = x + 100;
*12c85518Srobert// CONCLUSION: x is {142, 144}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert**Effects of control flow** are modeled by joining the knowledge from all
*12c85518Srobertpossible previous program points.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertif (...) {
*12c85518Srobert  ...
*12c85518Srobert  // GIVEN: x is {42}
*12c85518Srobert} else {
*12c85518Srobert  ...
*12c85518Srobert  // GIVEN: x is {44}
*12c85518Srobert}
*12c85518Srobert// CONCLUSION: x is {42; 44}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobert// GIVEN: x is {42}
*12c85518Srobertwhile (...) {
*12c85518Srobert  ...
*12c85518Srobert  // GIVEN: x is {44}
*12c85518Srobert}
*12c85518Srobert// CONCLUSION: {42; 44}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThe predicate that we marked "given" is usually called a precondition, and the
*12c85518Srobertconclusion is called a postcondition.
*12c85518Srobert
*12c85518SrobertIn terms of the CFG, we join the information from all predecessor basic blocks.
*12c85518Srobert
*12c85518Srobert![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)
*12c85518Srobert
*12c85518SrobertPutting it all together, to model the effects of a basic block we compute:
*12c85518Srobert
*12c85518Srobert```
*12c85518Srobertout = transfer(basic_block, join(in_1, in_2, ..., in_n))
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert(Note that there are other ways to write this equation that produce higher
*12c85518Srobertprecision analysis results. The trick is to keep exploring the execution paths
*12c85518Srobertseparately and delay joining until later. However, we won't discuss those
*12c85518Srobertvariations here.)
*12c85518Srobert
*12c85518SrobertTo make a conclusion about all paths through the program, we repeat this
*12c85518Srobertcomputation on all basic blocks until we reach a fixpoint. In other words, we
*12c85518Srobertkeep propagating information through the CFG until the computed sets of values
*12c85518Srobertstop changing.
*12c85518Srobert
*12c85518SrobertIf the lattice has a finite height and transfer functions are monotonic the
*12c85518Srobertalgorithm is guaranteed to terminate.  Each iteration of the algorithm can
*12c85518Srobertchange computed values only to larger values from the lattice. In the worst
*12c85518Srobertcase, all computed values become `⊤`, which is not very useful, but at least the
*12c85518Srobertanalysis terminates at that point, because it can't change any of the values.
*12c85518Srobert
*12c85518SrobertFixpoint iteration can be optimised by only reprocessing basic blocks which had
*12c85518Srobertone of their inputs changed on the previous iteration. This is typically
*12c85518Srobertimplemented using a worklist queue. With this optimisation the time complexity
*12c85518Srobertbecomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
*12c85518Srobert`|L|` is the size of lattice used by the analysis.
*12c85518Srobert
*12c85518Srobert## Symbolic execution: a very short informal introduction
*12c85518Srobert
*12c85518Srobert### Symbolic values
*12c85518Srobert
*12c85518SrobertIn the previous example where we tried to figure out what values a variable can
*12c85518Sroberthave, the analysis had to be seeded with a concrete value. What if there are no
*12c85518Srobertassignments of concrete values in the program? We can still deduce some
*12c85518Srobertinteresting information by representing unknown input values symbolically, and
*12c85518Srobertcomputing results as symbolic expressions:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid PrintAbs(int x) {
*12c85518Srobert  int result;
*12c85518Srobert  if (x >= 0) {
*12c85518Srobert    result = x;   // result is {x}
*12c85518Srobert  } else {
*12c85518Srobert    result = -x;  // result is {-x}
*12c85518Srobert  }
*12c85518Srobert  print(result);  // result is {x; -x}
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe can't say what specific value gets printed, but we know that it is either `x`
*12c85518Srobertor `-x`.
*12c85518Srobert
*12c85518SrobertDataflow analysis is an istance of abstract interpretation, and does not dictate
*12c85518Sroberthow exactly the lattice and transfer functions should be designed, beyond the
*12c85518Srobertnecessary conditions for the analysis to converge. Nevertheless, we can use
*12c85518Srobertsymbolic execution ideas to guide our design of the lattice and transfer
*12c85518Srobertfunctions: lattice values can be symbolic expressions, and transfer functions
*12c85518Srobertcan construct more complex symbolic expressions from symbolic expressions that
*12c85518Srobertrepresent arguments. See [this StackOverflow
*12c85518Srobertdiscussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
*12c85518Srobertfor a further comparison of abstract interpretation and symbolic execution.
*12c85518Srobert
*12c85518Srobert### Flow condition
*12c85518Srobert
*12c85518SrobertA human can say about the previous example that the function returns `x` when
*12c85518Srobert`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
*12c85518Sroberttracking a flow condition. A flow condition is a predicate written in terms of
*12c85518Srobertthe program state that is true at a specific program point regardless of the
*12c85518Srobertexecution path that led to this statement. For example, the flow condition for
*12c85518Srobertthe program point right before evaluating `result = x` is `x >= 0`.
*12c85518Srobert
*12c85518SrobertIf we enhance the lattice to be a set of pairs of values and predicates, the
*12c85518Srobertdataflow analysis computes the following values:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid PrintAbs(int x) {
*12c85518Srobert  int result;
*12c85518Srobert  if (x >= 0) {
*12c85518Srobert    // Flow condition: x >= 0.
*12c85518Srobert    result = x;   // result is {x if x >= 0}
*12c85518Srobert  } else {
*12c85518Srobert    // Flow condition: x < 0.
*12c85518Srobert    result = -x;  // result is {-x if x < 0}
*12c85518Srobert  }
*12c85518Srobert  print(result);  // result is {x if x >= 0; -x if x < 0}
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertOf course, in a program with loops, symbolic expressions for flow conditions can
*12c85518Srobertgrow unbounded. A practical static analysis system must control this growth to
*12c85518Srobertkeep the symbolic representations manageable and ensure that the data flow
*12c85518Srobertanalysis terminates. For example, it can use a constraint solver to prune
*12c85518Srobertimpossible flow conditions, and/or it can abstract them, losing precision, after
*12c85518Sroberttheir symbolic representations grow beyond some threshold. This is similar to
*12c85518Sroberthow we had to limit the sizes of computed sets of possible values to 3 elements.
*12c85518Srobert
*12c85518Srobert### Symbolic pointers
*12c85518Srobert
*12c85518SrobertThis approach proves to be particularly useful for modeling pointer values,
*12c85518Srobertsince we don't care about specific addresses but just want to give a unique
*12c85518Srobertidentifier to a memory location.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ExampleOfSymbolicPointers(bool b) {
*12c85518Srobert  int x = 0;     // x is {0}
*12c85518Srobert  int* ptr = &x; // x is {0}      ptr is {&x}
*12c85518Srobert  if (b) {
*12c85518Srobert    *ptr = 42;   // x is {42}     ptr is {&x}
*12c85518Srobert  }
*12c85518Srobert  print(x);      // x is {0; 42}  ptr is {&x}
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert## Example: finding output parameters
*12c85518Srobert
*12c85518SrobertLet's explore how data flow analysis can help with a problem that is hard to
*12c85518Srobertsolve with other tools in Clang.
*12c85518Srobert
*12c85518Srobert### Problem description
*12c85518Srobert
*12c85518SrobertOutput parameters are function parameters of pointer or reference type whose
*12c85518Srobertpointee is completely overwritten by the function, and not read before it is
*12c85518Srobertoverwritten. They are common in pre-C++11 code due to the absence of move
*12c85518Srobertsemantics. In modern C++ output parameters are non-idiomatic, and return values
*12c85518Srobertare used instead.
*12c85518Srobert
*12c85518SrobertImagine that we would like to refactor output parameters to return values to
*12c85518Srobertmodernize old code. The first step is to identify refactoring candidates through
*12c85518Srobertstatic analysis.
*12c85518Srobert
*12c85518SrobertFor example, in the following code snippet the pointer `c` is an output
*12c85518Srobertparameter:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertstruct Customer {
*12c85518Srobert  int account_id;
*12c85518Srobert  std::string name;
*12c85518Srobert}
*12c85518Srobert
*12c85518Srobertvoid GetCustomer(Customer *c) {
*12c85518Srobert  c->account_id = ...;
*12c85518Srobert  if (...) {
*12c85518Srobert    c->name = ...;
*12c85518Srobert  } else {
*12c85518Srobert    c->name = ...;
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe would like to refactor this code into:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518SrobertCustomer GetCustomer() {
*12c85518Srobert  Customer c;
*12c85518Srobert  c.account_id = ...;
*12c85518Srobert  if (...) {
*12c85518Srobert    c.name = ...;
*12c85518Srobert  } else {
*12c85518Srobert    c.name = ...;
*12c85518Srobert  }
*12c85518Srobert  return c;
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertHowever, in the function below the parameter `c` is not an output parameter
*12c85518Srobertbecause its field `name` is not overwritten on every path through the function.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid GetCustomer(Customer *c) {
*12c85518Srobert  c->account_id = ...;
*12c85518Srobert  if (...) {
*12c85518Srobert    c->name = ...;
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThe code also cannot read the value of the parameter before overwriting it:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid GetCustomer(Customer *c) {
*12c85518Srobert  use(c->account_id);
*12c85518Srobert  c->name = ...;
*12c85518Srobert  c->account_id = ...;
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertFunctions that escape the pointer also block the refactoring:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518SrobertCustomer* kGlobalCustomer;
*12c85518Srobert
*12c85518Srobertvoid GetCustomer(Customer *c) {
*12c85518Srobert  c->name = ...;
*12c85518Srobert  c->account_id = ...;
*12c85518Srobert  kGlobalCustomer = c;
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertTo identify a candidate function for refactoring, we need to do the following:
*12c85518Srobert
*12c85518Srobert*   Find a function with a non-const pointer or reference parameter.
*12c85518Srobert
*12c85518Srobert*   Find the definition of that function.
*12c85518Srobert
*12c85518Srobert*   Prove that the function completely overwrites the pointee on all paths
*12c85518Srobert    before returning.
*12c85518Srobert
*12c85518Srobert*   Prove that the function reads the pointee only after overwriting it.
*12c85518Srobert
*12c85518Srobert*   Prove that the function does not persist the pointer in a data structure
*12c85518Srobert    that is live after the function returns.
*12c85518Srobert
*12c85518SrobertThere are also requirements that all usage sites of the candidate function must
*12c85518Srobertsatisfy, for example, that function arguments do not alias, that users are not
*12c85518Sroberttaking the address of the function, and so on. Let's consider verifying usage
*12c85518Srobertsite conditions to be a separate static analysis problem.
*12c85518Srobert
*12c85518Srobert### Lattice design
*12c85518Srobert
*12c85518SrobertTo analyze the function body we can use a lattice which consists of normal
*12c85518Srobertstates and failure states. A normal state describes program points where we are
*12c85518Srobertsure that no behaviors that block the refactoring have occurred. Normal states
*12c85518Srobertkeep track of all parameter's member fields that are known to be overwritten on
*12c85518Srobertevery path from function entry to the corresponding program point. Failure
*12c85518Srobertstates accumulate observed violations (unsafe reads and pointer escapes) that
*12c85518Srobertblock the refactoring.
*12c85518Srobert
*12c85518SrobertIn the partial order of the lattice failure states compare greater than normal
*12c85518Srobertstates, which guarantees that they "win" when joined with normal states. Order
*12c85518Srobertbetween failure states is determined by inclusion relation on the set of
*12c85518Srobertaccumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
*12c85518Srobertbetween normal states is determined by reversed inclusion relation on the set of
*12c85518Srobertoverwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
*12c85518Srobertoverwritten fields).
*12c85518Srobert
*12c85518Srobert![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)
*12c85518Srobert
*12c85518SrobertTo determine whether a statement reads or writes a field we can implement
*12c85518Srobertsymbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
*12c85518Srobertdereference operator and `MemberExpr`s.
*12c85518Srobert
*12c85518Srobert### Using data flow results to identify output parameters
*12c85518Srobert
*12c85518SrobertLet's take a look at how we use data flow analysis to identify an output
*12c85518Srobertparameter. The refactoring can be safely done when the data flow algorithm
*12c85518Srobertcomputes a normal state with all of the fields proven to be overwritten in the
*12c85518Srobertexit basic block of the function.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertstruct Customer {
*12c85518Srobert  int account_id;
*12c85518Srobert  std::string name;
*12c85518Srobert};
*12c85518Srobert
*12c85518Srobertvoid GetCustomer(Customer* c) {
*12c85518Srobert  // Overwritten: {}
*12c85518Srobert  c->account_id = ...; // Overwritten: {c->account_id}
*12c85518Srobert  if (...) {
*12c85518Srobert    c->name = ...;     // Overwritten: {c->account_id, c->name}
*12c85518Srobert  } else {
*12c85518Srobert    c->name = ...;     // Overwritten: {c->account_id, c->name}
*12c85518Srobert  }
*12c85518Srobert  // Overwritten: {c->account_id, c->name}
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWhen the data flow algorithm computes a normal state, but not all fields are
*12c85518Srobertproven to be overwritten we can't perform the refactoring.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid target(bool b, Customer* c) {
*12c85518Srobert  // Overwritten: {}
*12c85518Srobert  if (b) {
*12c85518Srobert    c->account_id = 42;     // Overwritten: {c->account_id}
*12c85518Srobert  } else {
*12c85518Srobert    c->name = "Konrad";  // Overwritten: {c->name}
*12c85518Srobert  }
*12c85518Srobert  // Overwritten: {}
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertSimilarly, when the data flow algorithm computes a failure state, we also can't
*12c85518Srobertperform the refactoring.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518SrobertCustomer* kGlobalCustomer;
*12c85518Srobert
*12c85518Srobertvoid GetCustomer(Customer* c) {
*12c85518Srobert  // Overwritten: {}
*12c85518Srobert  c->account_id = ...;    // Overwritten: {c->account_id}
*12c85518Srobert  if (...) {
*12c85518Srobert    print(c->name);       // Unsafe read
*12c85518Srobert  } else {
*12c85518Srobert    kGlobalCustomer = c;  // Pointer escape
*12c85518Srobert  }
*12c85518Srobert  // Unsafe read, Pointer escape
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert## Example: finding dead stores
*12c85518Srobert
*12c85518SrobertLet's say we want to find redundant stores, because they indicate potential
*12c85518Srobertbugs.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertx = GetX();
*12c85518Srobertx = GetY();
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThe first store to `x` is never read, probably there is a bug.
*12c85518Srobert
*12c85518SrobertThe implementation of dead store analysis is very similar to output parameter
*12c85518Srobertanalysis: we need to track stores and loads, and find stores that were never
*12c85518Srobertread.
*12c85518Srobert
*12c85518Srobert[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
*12c85518Srobertgeneralization of this idea, which is often used to answer many related
*12c85518Srobertquestions, for example:
*12c85518Srobert
*12c85518Srobert* finding dead stores,
*12c85518Srobert* finding uninitialized variables,
*12c85518Srobert* finding a good point to deallocate memory,
*12c85518Srobert* finding out if it would be safe to move an object.
*12c85518Srobert
*12c85518Srobert## Example: definitive initialization
*12c85518Srobert
*12c85518SrobertDefinitive initialization proves that variables are known to be initialized when
*12c85518Srobertread. If we find a variable which is read when not initialized then we generate
*12c85518Sroberta warning.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Init() {
*12c85518Srobert  int x;    // x is uninitialized
*12c85518Srobert  if (cond()) {
*12c85518Srobert    x = 10; // x is initialized
*12c85518Srobert  } else {
*12c85518Srobert    x = 20; // x is initialized
*12c85518Srobert  }
*12c85518Srobert  print(x); // x is initialized
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Uninit() {
*12c85518Srobert  int x;    // x is uninitialized
*12c85518Srobert  if (cond()) {
*12c85518Srobert    x = 10; // x is initialized
*12c85518Srobert  }
*12c85518Srobert  print(x); // x is maybe uninitialized, x is being read, report a bug.
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertFor this purpose we can use lattice in a form of a mapping from variable
*12c85518Srobertdeclarations to initialization states; each initialization state is represented
*12c85518Srobertby the followingn lattice:
*12c85518Srobert
*12c85518Srobert![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)
*12c85518Srobert
*12c85518SrobertA lattice element could also capture the source locations of the branches that
*12c85518Srobertlead us to the corresponding program point. Diagnostics would use this
*12c85518Srobertinformation to show a sample buggy code path to the user.
*12c85518Srobert
*12c85518Srobert## Example: refactoring raw pointers to `unique_ptr`
*12c85518Srobert
*12c85518SrobertModern idiomatic C++ uses smart pointers to express memory ownership, however in
*12c85518Srobertpre-C++11 code one can often find raw pointers that own heap memory blocks.
*12c85518Srobert
*12c85518SrobertImagine that we would like to refactor raw pointers that own memory to
*12c85518Srobert`unique_ptr`. There are multiple ways to design a data flow analysis for this
*12c85518Srobertproblem; let's look at one way to do it.
*12c85518Srobert
*12c85518SrobertFor example, we would like to refactor the following code that uses raw
*12c85518Srobertpointers:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership1() {
*12c85518Srobert  int *pi = new int;
*12c85518Srobert  if (...) {
*12c85518Srobert    Borrow(pi);
*12c85518Srobert    delete pi;
*12c85518Srobert  } else {
*12c85518Srobert    TakeOwnership(pi);
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobertinto code that uses `unique_ptr`:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership1() {
*12c85518Srobert  auto pi = std::make_unique<int>();
*12c85518Srobert  if (...) {
*12c85518Srobert    Borrow(pi.get());
*12c85518Srobert  } else {
*12c85518Srobert    TakeOwnership(pi.release());
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThis problem can be solved with a lattice in form of map from value declarations
*12c85518Srobertto pointer states:
*12c85518Srobert
*12c85518Srobert![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)
*12c85518Srobert
*12c85518SrobertWe can perform the refactoring if at the exit of a function `pi` is
*12c85518Srobert`Compatible`.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership1() {
*12c85518Srobert  int *pi;             // pi is Compatible
*12c85518Srobert  pi = new int;        // pi is Defined
*12c85518Srobert  if (...) {
*12c85518Srobert    Borrow(pi);        // pi is Defined
*12c85518Srobert    delete pi;         // pi is Compatible
*12c85518Srobert  } else {
*12c85518Srobert    TakeOwnership(pi); // pi is Compatible
*12c85518Srobert  }
*12c85518Srobert  // pi is Compatible
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertLet's look at an example where the raw pointer owns two different memory blocks:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership2() {
*12c85518Srobert  int *pi = new int;  // pi is Defined
*12c85518Srobert  Borrow(pi);
*12c85518Srobert  delete pi;          // pi is Compatible
*12c85518Srobert  if (smth) {
*12c85518Srobert    pi = new int;     // pi is Defined
*12c85518Srobert    Borrow(pi);
*12c85518Srobert    delete pi;        // pi is Compatible
*12c85518Srobert  }
*12c85518Srobert  // pi is Compatible
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertIt can be refactored to use `unique_ptr` like this:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership2() {
*12c85518Srobert  auto pi = make_unique<int>();
*12c85518Srobert  Borrow(pi);
*12c85518Srobert  if (smth) {
*12c85518Srobert    pi = make_unique<int>();
*12c85518Srobert    Borrow(pi);
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertIn the following example, the raw pointer is used to access the heap object
*12c85518Srobertafter the ownership has been transferred.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership3() {
*12c85518Srobert  int *pi = new int; // pi is Defined
*12c85518Srobert  if (...) {
*12c85518Srobert    Borrow(pi);
*12c85518Srobert    delete pi;       // pi is Compatible
*12c85518Srobert  } else {
*12c85518Srobert    vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
*12c85518Srobert    print(*pi);
*12c85518Srobert    use(v);
*12c85518Srobert  }
*12c85518Srobert  // pi is Compatible
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe can refactor this code to use `unique_ptr`, however we would have to
*12c85518Srobertintroduce a non-owning pointer variable, since we can't use the moved-from
*12c85518Srobert`unique_ptr` to access the object:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UniqueOwnership3() {
*12c85518Srobert  std::unique_ptr<int> pi = std::make_unique<int>();
*12c85518Srobert  if (...) {
*12c85518Srobert    Borrow(pi);
*12c85518Srobert  } else {
*12c85518Srobert    int *pi_non_owning = pi.get();
*12c85518Srobert    vector<unique_ptr<int>> v = {std::move(pi)};
*12c85518Srobert    print(*pi_non_owning);
*12c85518Srobert    use(v);
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertIf the original code didn't call `delete` at the very end of the function, then
*12c85518Srobertour refactoring may change the point at which we run the destructor and release
*12c85518Srobertmemory. Specifically, if there is some user code after `delete`, then extending
*12c85518Srobertthe lifetime of the object until the end of the function may hold locks for
*12c85518Srobertlonger than necessary, introduce memory overhead etc.
*12c85518Srobert
*12c85518SrobertOne solution is to always replace `delete` with a call to `reset()`, and then
*12c85518Srobertperform another analysis that removes unnecessary `reset()` calls.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid AddedMemoryOverhead() {
*12c85518Srobert  HugeObject *ho = new HugeObject();
*12c85518Srobert  use(ho);
*12c85518Srobert  delete ho; // Release the large amount of memory quickly.
*12c85518Srobert  LongRunningFunction();
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThis analysis will refuse to refactor code that mixes borrowed pointer values
*12c85518Srobertand unique ownership. In the following code, `GetPtr()` returns a borrowed
*12c85518Srobertpointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
*12c85518Srobertpointer. We don't distinguish between these two assignments, and we want each
*12c85518Srobertassignment to be paired with a corresponding sink; otherwise, we transition the
*12c85518Srobertpointer to a `Conflicting` state, like in this example.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ConflictingOwnership() {
*12c85518Srobert  int *pi;           // pi is Compatible
*12c85518Srobert  pi = GetPtr();     // pi is Defined
*12c85518Srobert  Borrow(pi);        // pi is Defined
*12c85518Srobert
*12c85518Srobert  pi = new int;      // pi is Conflicting
*12c85518Srobert  Borrow(pi);
*12c85518Srobert  delete pi;
*12c85518Srobert  // pi is Conflicting
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe could still handle this case by finding a maximal range in the code where
*12c85518Srobert`pi` could be in the Compatible state, and only refactoring that part.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid ConflictingOwnership() {
*12c85518Srobert  int *pi;
*12c85518Srobert  pi = GetPtr();
*12c85518Srobert  Borrow(pi);
*12c85518Srobert
*12c85518Srobert  std::unique_ptr<int> pi_unique = std::make_unique<int>();
*12c85518Srobert  Borrow(pi_unique.get());
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert## Example: finding redundant branch conditions
*12c85518Srobert
*12c85518SrobertIn the code below `b1` should not be checked in both the outer and inner "if"
*12c85518Srobertstatements. It is likely there is a bug in this code.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertint F(bool b1, bool b2) {
*12c85518Srobert  if (b1) {
*12c85518Srobert    f();
*12c85518Srobert    if (b1 && b2) {  // Check `b1` again -- unnecessary!
*12c85518Srobert      g();
*12c85518Srobert    }
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertA checker that finds this pattern syntactically is already implemented in
*12c85518SrobertClangTidy using AST matchers (`bugprone-redundant-branch-condition`).
*12c85518Srobert
*12c85518SrobertTo implement it using the data flow analysis framework, we can produce a warning
*12c85518Srobertif any part of the branch condition is implied by the flow condition.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertint F(bool b1, bool b2) {
*12c85518Srobert  // Flow condition: true.
*12c85518Srobert  if (b1) {
*12c85518Srobert    // Flow condition: b1.
*12c85518Srobert    f();
*12c85518Srobert    if (b1 && b2) { // `b1` is implied by the flow condition.
*12c85518Srobert      g();
*12c85518Srobert    }
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertOne way to check this implication is to use a SAT solver. Without a SAT solver,
*12c85518Srobertwe could keep the flow condition in the CNF form and then it would be easy to
*12c85518Srobertcheck the implication.
*12c85518Srobert
*12c85518Srobert## Example: finding unchecked `std::optional` unwraps
*12c85518Srobert
*12c85518SrobertCalling `optional::value()` is only valid if `optional::has_value()` is true. We
*12c85518Srobertwant to show that when `x.value()` is executed, the flow condition implies
*12c85518Srobert`x.has_value()`.
*12c85518Srobert
*12c85518SrobertIn the example below `x.value()` is accessed safely because it is guarded by the
*12c85518Srobert`x.has_value()` check.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Example(std::optional<int> &x) {
*12c85518Srobert  if (x.has_value()) {
*12c85518Srobert    use(x.value());
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWhile entering the if branch we deduce that `x.has_value()` is implied by the
*12c85518Srobertflow condition.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Example(std::optional<int> x) {
*12c85518Srobert  // Flow condition: true.
*12c85518Srobert  if (x.has_value()) {
*12c85518Srobert    // Flow condition: x.has_value() == true.
*12c85518Srobert    use(x.value());
*12c85518Srobert  }
*12c85518Srobert  // Flow condition: true.
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe also need to prove that `x` is not modified between check and value access.
*12c85518SrobertThe modification of `x` may be very subtle:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid F(std::optional<int> &x);
*12c85518Srobert
*12c85518Srobertvoid Example(std::optional<int> &x) {
*12c85518Srobert  if (x.has_value()) {
*12c85518Srobert    // Flow condition: x.has_value() == true.
*12c85518Srobert    unknown_function(x); // may change x.
*12c85518Srobert    // Flow condition: true.
*12c85518Srobert    use(x.value());
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert## Example: finding dead code behind A/B experiment flags
*12c85518Srobert
*12c85518SrobertFinding dead code is a classic application of data flow analysis.
*12c85518Srobert
*12c85518SrobertUnused flags for A/B experiment hide dead code. However, this flavor of dead
*12c85518Srobertcode is invisible to the compiler because the flag can be turned on at any
*12c85518Srobertmoment.
*12c85518Srobert
*12c85518SrobertWe could make a tool that deletes experiment flags. The user tells us which flag
*12c85518Srobertthey want to delete, and we assume that the it's value is a given constant.
*12c85518Srobert
*12c85518SrobertFor example, the user could use the tool to remove `example_flag` from this
*12c85518Srobertcode:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518SrobertDEFINE_FLAG(std::string, example_flag, "", "A sample flag.");
*12c85518Srobert
*12c85518Srobertvoid Example() {
*12c85518Srobert  bool x = GetFlag(FLAGS_example_flag).empty();
*12c85518Srobert  f();
*12c85518Srobert  if (x) {
*12c85518Srobert    g();
*12c85518Srobert  } else {
*12c85518Srobert    h();
*12c85518Srobert  }
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertThe tool would simplify the code to:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid Example() {
*12c85518Srobert  f();
*12c85518Srobert  g();
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe can solve this problem with a classic constant propagation lattice combined
*12c85518Srobertwith symbolic evaluation.
*12c85518Srobert
*12c85518Srobert## Example: finding inefficient usages of associative containers
*12c85518Srobert
*12c85518SrobertReal-world code often accidentally performs repeated lookups in associative
*12c85518Srobertcontainers:
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertmap<int, Employee> xs;
*12c85518Srobertxs[42]->name = "...";
*12c85518Srobertxs[42]->title = "...";
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertTo find the above inefficiency we can use the available expressions analysis to
*12c85518Srobertunderstand that `m[42]` is evaluated twice.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertmap<int, Employee> xs;
*12c85518SrobertEmployee &e = xs[42];
*12c85518Sroberte->name = "...";
*12c85518Sroberte->title = "...";
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertWe can also track the `m.contains()` check in the flow condition to find
*12c85518Srobertredundant checks, like in the example below.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertstd::map<int, Employee> xs;
*12c85518Srobertif (!xs.contains(42)) {
*12c85518Srobert  xs.insert({42, someEmployee});
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518Srobert## Example: refactoring types that implicitly convert to each other
*12c85518Srobert
*12c85518SrobertRefactoring one strong type to another is difficult, but the compiler can help:
*12c85518Srobertonce you refactor one reference to the type, the compiler will flag other places
*12c85518Srobertwhere this information flows with type mismatch errors. Unfortunately this
*12c85518Srobertstrategy does not work when you are refactoring types that implicitly convert to
*12c85518Sroberteach other, for example, replacing `int32_t` with `int64_t`.
*12c85518Srobert
*12c85518SrobertImagine that we want to change user IDs from 32 to 64-bit integers. In other
*12c85518Srobertwords, we need to find all integers tainted with user IDs. We can use data flow
*12c85518Srobertanalysis to implement taint analysis.
*12c85518Srobert
*12c85518Srobert```c++
*12c85518Srobertvoid UseUser(int32_t user_id) {
*12c85518Srobert  int32_t id = user_id;
*12c85518Srobert  // Variable `id` is tainted with a user ID.
*12c85518Srobert  ...
*12c85518Srobert}
*12c85518Srobert```
*12c85518Srobert
*12c85518SrobertTaint analysis is very well suited to this problem because the program rarely
*12c85518Srobertbranches on user IDs, and almost certainly does not perform any computation
*12c85518Srobert(like arithmetic).