1*12c85518Srobert# Data flow analysis: an informal introduction 2*12c85518Srobert 3*12c85518Srobert## Abstract 4*12c85518Srobert 5*12c85518SrobertThis document introduces data flow analysis in an informal way. The goal is to 6*12c85518Srobertgive the reader an intuitive understanding of how it works, and show how it 7*12c85518Srobertapplies to a range of refactoring and bug finding problems. 8*12c85518Srobert 9*12c85518SrobertData flow analysis is a well-established technique; it is described in many 10*12c85518Srobertpapers, books, and videos. If you would like a more formal, or a more thorough 11*12c85518Srobertexplanation of the concepts mentioned in this document, please refer to the 12*12c85518Srobertfollowing resources: 13*12c85518Srobert 14*12c85518Srobert* [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)). 15*12c85518Srobert* Videos on the PacketPrep YouTube channel that introduce lattices and the 16*12c85518Srobert necessary background information: 17*12c85518Srobert [#20](https://www.youtube.com/watch?v=73j_FXBXGm8), 18*12c85518Srobert [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8), 19*12c85518Srobert [#22](https://www.youtube.com/watch?v=saOG7Uooeho), 20*12c85518Srobert [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g), 21*12c85518Srobert [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc), 22*12c85518Srobert [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw). 23*12c85518Srobert* [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE) 24*12c85518Srobert* [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf). 25*12c85518Srobert* [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf). 26*12c85518Srobert* [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/). 27*12c85518Srobert* [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf) 28*12c85518Srobert (a paper that successfully applies symbolic execution to real-world 29*12c85518Srobert software). 30*12c85518Srobert 31*12c85518Srobert## Data flow analysis 32*12c85518Srobert 33*12c85518Srobert### The purpose of data flow analysis 34*12c85518Srobert 35*12c85518SrobertData flow analysis is a static analysis technique that proves facts about a 36*12c85518Srobertprogram or its fragment. It can make conclusions about all paths through the 37*12c85518Srobertprogram, while taking control flow into account and scaling to large programs. 38*12c85518SrobertThe basic idea is propagating facts about the program through the edges of the 39*12c85518Srobertcontrol flow graph (CFG) until a fixpoint is reached. 40*12c85518Srobert 41*12c85518Srobert### Sample problem and an ad-hoc solution 42*12c85518Srobert 43*12c85518SrobertWe would like to explain data flow analysis while discussing an example. Let's 44*12c85518Srobertimagine that we want to track possible values of an integer variable in our 45*12c85518Srobertprogram. Here is how a human could annotate the code: 46*12c85518Srobert 47*12c85518Srobert```c++ 48*12c85518Srobertvoid Example(int n) { 49*12c85518Srobert int x = 0; 50*12c85518Srobert // x is {0} 51*12c85518Srobert if (n > 0) { 52*12c85518Srobert x = 5; 53*12c85518Srobert // x is {5} 54*12c85518Srobert } else { 55*12c85518Srobert x = 42; 56*12c85518Srobert // x is {42} 57*12c85518Srobert } 58*12c85518Srobert // x is {5; 42} 59*12c85518Srobert print(x); 60*12c85518Srobert} 61*12c85518Srobert``` 62*12c85518Srobert 63*12c85518SrobertWe use sets of integers to represent possible values of `x`. Local variables 64*12c85518Sroberthave unambiguous values between statements, so we annotate program points 65*12c85518Srobertbetween statements with sets of possible values. 66*12c85518Srobert 67*12c85518SrobertHere is how we arrived at these annotations. Assigning a constant to `x` allows 68*12c85518Srobertus to make a conclusion that `x` can only have one value. When control flow from 69*12c85518Srobertthe "then" and "else" branches joins, `x` can have either value. 70*12c85518Srobert 71*12c85518SrobertAbstract algebra provides a nice formalism that models this kind of structure, 72*12c85518Srobertnamely, a lattice. A join-semilattice is a partially ordered set, in which every 73*12c85518Sroberttwo elements have a least upper bound (called a *join*). 74*12c85518Srobert 75*12c85518Srobert``` 76*12c85518Srobertjoin(a, b) ⩾ a and join(a, b) ⩾ b and join(x, x) = x 77*12c85518Srobert``` 78*12c85518Srobert 79*12c85518SrobertFor this problem we will use the lattice of subsets of integers, with set 80*12c85518Srobertinclusion relation as ordering and set union as a join. 81*12c85518Srobert 82*12c85518SrobertLattices are often represented visually as Hasse diagrams. Here is a Hasse 83*12c85518Srobertdiagram for our lattice that tracks subsets of integers: 84*12c85518Srobert 85*12c85518Srobert 86*12c85518Srobert 87*12c85518SrobertComputing the join in the lattice corresponds to finding the lowest common 88*12c85518Srobertancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of 89*12c85518Srobertliterature on efficiently implementing LCA queries for a DAG, however Efficient 90*12c85518SrobertImplementation of Lattice Operations (1989) 91*12c85518Srobert([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911), 92*12c85518Srobert[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that 93*12c85518Srobertparticularly well-suited for programmatic implementation. 94*12c85518Srobert 95*12c85518Srobert### Too much information and "top" values 96*12c85518Srobert 97*12c85518SrobertLet's try to find the possible sets of values of `x` in a function that modifies 98*12c85518Srobert`x` in a loop: 99*12c85518Srobert 100*12c85518Srobert```c++ 101*12c85518Srobertvoid ExampleOfInfiniteSets() { 102*12c85518Srobert int x = 0; // x is {0} 103*12c85518Srobert while (condition()) { 104*12c85518Srobert x += 1; // x is {0; 1; 2; …} 105*12c85518Srobert } 106*12c85518Srobert print(x); // x is {0; 1; 2; …} 107*12c85518Srobert} 108*12c85518Srobert``` 109*12c85518Srobert 110*12c85518SrobertWe have an issue: `x` can have any value greater than zero; that's an infinite 111*12c85518Srobertset of values, if the program operated on mathematical integers. In C++ `int` is 112*12c85518Srobertlimited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is 113*12c85518Srobertstill really big. 114*12c85518Srobert 115*12c85518SrobertTo make our analysis practical to compute, we have to limit the amount of 116*12c85518Srobertinformation that we track. In this case, we can, for example, arbitrarily limit 117*12c85518Srobertthe size of sets to 3 elements. If at a certain program point `x` has more than 118*12c85518Srobert3 possible values, we stop tracking specific values at that program point. 119*12c85518SrobertInstead, we denote possible values of `x` with the symbol `⊤` (pronounced "top" 120*12c85518Srobertaccording to a convention in abstract algebra). 121*12c85518Srobert 122*12c85518Srobert```c++ 123*12c85518Srobertvoid ExampleOfTopWithALoop() { 124*12c85518Srobert int x = 0; // x is {0} 125*12c85518Srobert while (condition()) { 126*12c85518Srobert x += 1; // x is ⊤ 127*12c85518Srobert } 128*12c85518Srobert print(x); // x is ⊤ 129*12c85518Srobert} 130*12c85518Srobert``` 131*12c85518Srobert 132*12c85518SrobertThe statement "at this program point, `x`'s possible values are `⊤`" is 133*12c85518Srobertunderstood as "at this program point `x` can have any value because we have too 134*12c85518Srobertmuch information, or the information is conflicting". 135*12c85518Srobert 136*12c85518SrobertNote that we can get more than 3 possible values even without a loop: 137*12c85518Srobert 138*12c85518Srobert```c++ 139*12c85518Srobertvoid ExampleOfTopWithoutLoops(int n) { 140*12c85518Srobert int x = 0; // x is {0} 141*12c85518Srobert switch(n) { 142*12c85518Srobert case 0: x = 1; break; // x is {1} 143*12c85518Srobert case 1: x = 9; break; // x is {9} 144*12c85518Srobert case 2: x = 7; break; // x is {7} 145*12c85518Srobert default: x = 3; break; // x is {3} 146*12c85518Srobert } 147*12c85518Srobert // x is ⊤ 148*12c85518Srobert} 149*12c85518Srobert``` 150*12c85518Srobert 151*12c85518Srobert### Uninitialized variables and "bottom" values 152*12c85518Srobert 153*12c85518SrobertWhen `x` is declared but not initialized, it has no possible values. We 154*12c85518Srobertrepresent this fact symbolically as `⊥` (pronounced "bottom"). 155*12c85518Srobert 156*12c85518Srobert```c++ 157*12c85518Srobertvoid ExampleOfBottom() { 158*12c85518Srobert int x; // x is ⊥ 159*12c85518Srobert x = 42; // x is {42} 160*12c85518Srobert print(x); 161*12c85518Srobert} 162*12c85518Srobert``` 163*12c85518Srobert 164*12c85518SrobertNote that using values read from uninitialized variables is undefined behaviour 165*12c85518Srobertin C++. Generally, compilers and static analysis tools can assume undefined 166*12c85518Srobertbehavior does not happen. We must model uninitialized variables only when we are 167*12c85518Srobertimplementing a checker that specifically is trying to find uninitialized reads. 168*12c85518SrobertIn this example we show how to model uninitialized variables only to demonstrate 169*12c85518Srobertthe concept of "bottom", and how it applies to possible value analysis. We 170*12c85518Srobertdescribe an analysis that finds uninitialized reads in a section below. 171*12c85518Srobert 172*12c85518Srobert### A practical lattice that tracks sets of concrete values 173*12c85518Srobert 174*12c85518SrobertTaking into account all corner cases covered above, we can put together a 175*12c85518Srobertlattice that we can use in practice to track possible values of integer 176*12c85518Srobertvariables. This lattice represents sets of integers with 1, 2, or 3 elements, as 177*12c85518Srobertwell as top and bottom. Here is a Hasse diagram for it: 178*12c85518Srobert 179*12c85518Srobert 180*12c85518Srobert 181*12c85518Srobert### Formalization 182*12c85518Srobert 183*12c85518SrobertLet's consider a slightly more complex example, and think about how we can 184*12c85518Srobertcompute the sets of possible values algorithmically. 185*12c85518Srobert 186*12c85518Srobert```c++ 187*12c85518Srobertvoid Example(int n) { 188*12c85518Srobert int x; // x is ⊥ 189*12c85518Srobert if (n > 0) { 190*12c85518Srobert if (n == 42) { 191*12c85518Srobert x = 44; // x is {44} 192*12c85518Srobert } else { 193*12c85518Srobert x = 5; // x is {5} 194*12c85518Srobert } 195*12c85518Srobert print(x); // x is {44; 5} 196*12c85518Srobert } else { 197*12c85518Srobert x = n; // x is ⊤ 198*12c85518Srobert } 199*12c85518Srobert print(x); // x is ⊤ 200*12c85518Srobert} 201*12c85518Srobert``` 202*12c85518Srobert 203*12c85518SrobertAs humans, we understand the control flow from the program text. We used our 204*12c85518Srobertunderstanding of control flow to find program points where two flows join. 205*12c85518SrobertFormally, control flow is represented by a CFG (control flow graph): 206*12c85518Srobert 207*12c85518Srobert 208*12c85518Srobert 209*12c85518SrobertWe can compute sets of possible values by propagating them through the CFG of 210*12c85518Srobertthe function: 211*12c85518Srobert 212*12c85518Srobert* When `x` is declared but not initialized, its possible values are `{}`. The 213*12c85518Srobert empty set plays the role of `⊥` in this lattice. 214*12c85518Srobert 215*12c85518Srobert* When `x` is assigned a concrete value, its possible set of values contains 216*12c85518Srobert just that specific value. 217*12c85518Srobert 218*12c85518Srobert* When `x` is assigned some unknown value, it can have any value. We represent 219*12c85518Srobert this fact as `⊤`. 220*12c85518Srobert 221*12c85518Srobert* When two control flow paths join, we compute the set union of incoming 222*12c85518Srobert values (limiting the number of elements to 3, representig larger sets as 223*12c85518Srobert `⊤`). 224*12c85518Srobert 225*12c85518SrobertThe sets of possible values are influenced by: 226*12c85518Srobert 227*12c85518Srobert* Statements, for example, assignments. 228*12c85518Srobert 229*12c85518Srobert* Joins in control flow, for example, ones that appear at the end of "if" 230*12c85518Srobert statements. 231*12c85518Srobert 232*12c85518Srobert**Effects of statements** are modeled by what is formally known as a transfer 233*12c85518Srobertfunction. A transfer function takes two arguments: the statement, and the state 234*12c85518Srobertof `x` at the previous program point. It produces the state of `x` at the next 235*12c85518Srobertprogram point. For example, the transfer function for assignment ignores the 236*12c85518Srobertstate at the previous program point: 237*12c85518Srobert 238*12c85518Srobert```c++ 239*12c85518Srobert// GIVEN: x is {42; 44} 240*12c85518Srobertx = 0; 241*12c85518Srobert// CONCLUSION: x is {0} 242*12c85518Srobert``` 243*12c85518Srobert 244*12c85518SrobertThe transfer function for `+` performs arithmetic on every set member: 245*12c85518Srobert 246*12c85518Srobert```c++ 247*12c85518Srobert// GIVEN: x is {42, 44} 248*12c85518Srobertx = x + 100; 249*12c85518Srobert// CONCLUSION: x is {142, 144} 250*12c85518Srobert``` 251*12c85518Srobert 252*12c85518Srobert**Effects of control flow** are modeled by joining the knowledge from all 253*12c85518Srobertpossible previous program points. 254*12c85518Srobert 255*12c85518Srobert```c++ 256*12c85518Srobertif (...) { 257*12c85518Srobert ... 258*12c85518Srobert // GIVEN: x is {42} 259*12c85518Srobert} else { 260*12c85518Srobert ... 261*12c85518Srobert // GIVEN: x is {44} 262*12c85518Srobert} 263*12c85518Srobert// CONCLUSION: x is {42; 44} 264*12c85518Srobert``` 265*12c85518Srobert 266*12c85518Srobert```c++ 267*12c85518Srobert// GIVEN: x is {42} 268*12c85518Srobertwhile (...) { 269*12c85518Srobert ... 270*12c85518Srobert // GIVEN: x is {44} 271*12c85518Srobert} 272*12c85518Srobert// CONCLUSION: {42; 44} 273*12c85518Srobert``` 274*12c85518Srobert 275*12c85518SrobertThe predicate that we marked "given" is usually called a precondition, and the 276*12c85518Srobertconclusion is called a postcondition. 277*12c85518Srobert 278*12c85518SrobertIn terms of the CFG, we join the information from all predecessor basic blocks. 279*12c85518Srobert 280*12c85518Srobert 281*12c85518Srobert 282*12c85518SrobertPutting it all together, to model the effects of a basic block we compute: 283*12c85518Srobert 284*12c85518Srobert``` 285*12c85518Srobertout = transfer(basic_block, join(in_1, in_2, ..., in_n)) 286*12c85518Srobert``` 287*12c85518Srobert 288*12c85518Srobert(Note that there are other ways to write this equation that produce higher 289*12c85518Srobertprecision analysis results. The trick is to keep exploring the execution paths 290*12c85518Srobertseparately and delay joining until later. However, we won't discuss those 291*12c85518Srobertvariations here.) 292*12c85518Srobert 293*12c85518SrobertTo make a conclusion about all paths through the program, we repeat this 294*12c85518Srobertcomputation on all basic blocks until we reach a fixpoint. In other words, we 295*12c85518Srobertkeep propagating information through the CFG until the computed sets of values 296*12c85518Srobertstop changing. 297*12c85518Srobert 298*12c85518SrobertIf the lattice has a finite height and transfer functions are monotonic the 299*12c85518Srobertalgorithm is guaranteed to terminate. Each iteration of the algorithm can 300*12c85518Srobertchange computed values only to larger values from the lattice. In the worst 301*12c85518Srobertcase, all computed values become `⊤`, which is not very useful, but at least the 302*12c85518Srobertanalysis terminates at that point, because it can't change any of the values. 303*12c85518Srobert 304*12c85518SrobertFixpoint iteration can be optimised by only reprocessing basic blocks which had 305*12c85518Srobertone of their inputs changed on the previous iteration. This is typically 306*12c85518Srobertimplemented using a worklist queue. With this optimisation the time complexity 307*12c85518Srobertbecomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and 308*12c85518Srobert`|L|` is the size of lattice used by the analysis. 309*12c85518Srobert 310*12c85518Srobert## Symbolic execution: a very short informal introduction 311*12c85518Srobert 312*12c85518Srobert### Symbolic values 313*12c85518Srobert 314*12c85518SrobertIn the previous example where we tried to figure out what values a variable can 315*12c85518Sroberthave, the analysis had to be seeded with a concrete value. What if there are no 316*12c85518Srobertassignments of concrete values in the program? We can still deduce some 317*12c85518Srobertinteresting information by representing unknown input values symbolically, and 318*12c85518Srobertcomputing results as symbolic expressions: 319*12c85518Srobert 320*12c85518Srobert```c++ 321*12c85518Srobertvoid PrintAbs(int x) { 322*12c85518Srobert int result; 323*12c85518Srobert if (x >= 0) { 324*12c85518Srobert result = x; // result is {x} 325*12c85518Srobert } else { 326*12c85518Srobert result = -x; // result is {-x} 327*12c85518Srobert } 328*12c85518Srobert print(result); // result is {x; -x} 329*12c85518Srobert} 330*12c85518Srobert``` 331*12c85518Srobert 332*12c85518SrobertWe can't say what specific value gets printed, but we know that it is either `x` 333*12c85518Srobertor `-x`. 334*12c85518Srobert 335*12c85518SrobertDataflow analysis is an istance of abstract interpretation, and does not dictate 336*12c85518Sroberthow exactly the lattice and transfer functions should be designed, beyond the 337*12c85518Srobertnecessary conditions for the analysis to converge. Nevertheless, we can use 338*12c85518Srobertsymbolic execution ideas to guide our design of the lattice and transfer 339*12c85518Srobertfunctions: lattice values can be symbolic expressions, and transfer functions 340*12c85518Srobertcan construct more complex symbolic expressions from symbolic expressions that 341*12c85518Srobertrepresent arguments. See [this StackOverflow 342*12c85518Srobertdiscussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation) 343*12c85518Srobertfor a further comparison of abstract interpretation and symbolic execution. 344*12c85518Srobert 345*12c85518Srobert### Flow condition 346*12c85518Srobert 347*12c85518SrobertA human can say about the previous example that the function returns `x` when 348*12c85518Srobert`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by 349*12c85518Sroberttracking a flow condition. A flow condition is a predicate written in terms of 350*12c85518Srobertthe program state that is true at a specific program point regardless of the 351*12c85518Srobertexecution path that led to this statement. For example, the flow condition for 352*12c85518Srobertthe program point right before evaluating `result = x` is `x >= 0`. 353*12c85518Srobert 354*12c85518SrobertIf we enhance the lattice to be a set of pairs of values and predicates, the 355*12c85518Srobertdataflow analysis computes the following values: 356*12c85518Srobert 357*12c85518Srobert```c++ 358*12c85518Srobertvoid PrintAbs(int x) { 359*12c85518Srobert int result; 360*12c85518Srobert if (x >= 0) { 361*12c85518Srobert // Flow condition: x >= 0. 362*12c85518Srobert result = x; // result is {x if x >= 0} 363*12c85518Srobert } else { 364*12c85518Srobert // Flow condition: x < 0. 365*12c85518Srobert result = -x; // result is {-x if x < 0} 366*12c85518Srobert } 367*12c85518Srobert print(result); // result is {x if x >= 0; -x if x < 0} 368*12c85518Srobert} 369*12c85518Srobert``` 370*12c85518Srobert 371*12c85518SrobertOf course, in a program with loops, symbolic expressions for flow conditions can 372*12c85518Srobertgrow unbounded. A practical static analysis system must control this growth to 373*12c85518Srobertkeep the symbolic representations manageable and ensure that the data flow 374*12c85518Srobertanalysis terminates. For example, it can use a constraint solver to prune 375*12c85518Srobertimpossible flow conditions, and/or it can abstract them, losing precision, after 376*12c85518Sroberttheir symbolic representations grow beyond some threshold. This is similar to 377*12c85518Sroberthow we had to limit the sizes of computed sets of possible values to 3 elements. 378*12c85518Srobert 379*12c85518Srobert### Symbolic pointers 380*12c85518Srobert 381*12c85518SrobertThis approach proves to be particularly useful for modeling pointer values, 382*12c85518Srobertsince we don't care about specific addresses but just want to give a unique 383*12c85518Srobertidentifier to a memory location. 384*12c85518Srobert 385*12c85518Srobert```c++ 386*12c85518Srobertvoid ExampleOfSymbolicPointers(bool b) { 387*12c85518Srobert int x = 0; // x is {0} 388*12c85518Srobert int* ptr = &x; // x is {0} ptr is {&x} 389*12c85518Srobert if (b) { 390*12c85518Srobert *ptr = 42; // x is {42} ptr is {&x} 391*12c85518Srobert } 392*12c85518Srobert print(x); // x is {0; 42} ptr is {&x} 393*12c85518Srobert} 394*12c85518Srobert``` 395*12c85518Srobert 396*12c85518Srobert## Example: finding output parameters 397*12c85518Srobert 398*12c85518SrobertLet's explore how data flow analysis can help with a problem that is hard to 399*12c85518Srobertsolve with other tools in Clang. 400*12c85518Srobert 401*12c85518Srobert### Problem description 402*12c85518Srobert 403*12c85518SrobertOutput parameters are function parameters of pointer or reference type whose 404*12c85518Srobertpointee is completely overwritten by the function, and not read before it is 405*12c85518Srobertoverwritten. They are common in pre-C++11 code due to the absence of move 406*12c85518Srobertsemantics. In modern C++ output parameters are non-idiomatic, and return values 407*12c85518Srobertare used instead. 408*12c85518Srobert 409*12c85518SrobertImagine that we would like to refactor output parameters to return values to 410*12c85518Srobertmodernize old code. The first step is to identify refactoring candidates through 411*12c85518Srobertstatic analysis. 412*12c85518Srobert 413*12c85518SrobertFor example, in the following code snippet the pointer `c` is an output 414*12c85518Srobertparameter: 415*12c85518Srobert 416*12c85518Srobert```c++ 417*12c85518Srobertstruct Customer { 418*12c85518Srobert int account_id; 419*12c85518Srobert std::string name; 420*12c85518Srobert} 421*12c85518Srobert 422*12c85518Srobertvoid GetCustomer(Customer *c) { 423*12c85518Srobert c->account_id = ...; 424*12c85518Srobert if (...) { 425*12c85518Srobert c->name = ...; 426*12c85518Srobert } else { 427*12c85518Srobert c->name = ...; 428*12c85518Srobert } 429*12c85518Srobert} 430*12c85518Srobert``` 431*12c85518Srobert 432*12c85518SrobertWe would like to refactor this code into: 433*12c85518Srobert 434*12c85518Srobert```c++ 435*12c85518SrobertCustomer GetCustomer() { 436*12c85518Srobert Customer c; 437*12c85518Srobert c.account_id = ...; 438*12c85518Srobert if (...) { 439*12c85518Srobert c.name = ...; 440*12c85518Srobert } else { 441*12c85518Srobert c.name = ...; 442*12c85518Srobert } 443*12c85518Srobert return c; 444*12c85518Srobert} 445*12c85518Srobert``` 446*12c85518Srobert 447*12c85518SrobertHowever, in the function below the parameter `c` is not an output parameter 448*12c85518Srobertbecause its field `name` is not overwritten on every path through the function. 449*12c85518Srobert 450*12c85518Srobert```c++ 451*12c85518Srobertvoid GetCustomer(Customer *c) { 452*12c85518Srobert c->account_id = ...; 453*12c85518Srobert if (...) { 454*12c85518Srobert c->name = ...; 455*12c85518Srobert } 456*12c85518Srobert} 457*12c85518Srobert``` 458*12c85518Srobert 459*12c85518SrobertThe code also cannot read the value of the parameter before overwriting it: 460*12c85518Srobert 461*12c85518Srobert```c++ 462*12c85518Srobertvoid GetCustomer(Customer *c) { 463*12c85518Srobert use(c->account_id); 464*12c85518Srobert c->name = ...; 465*12c85518Srobert c->account_id = ...; 466*12c85518Srobert} 467*12c85518Srobert``` 468*12c85518Srobert 469*12c85518SrobertFunctions that escape the pointer also block the refactoring: 470*12c85518Srobert 471*12c85518Srobert```c++ 472*12c85518SrobertCustomer* kGlobalCustomer; 473*12c85518Srobert 474*12c85518Srobertvoid GetCustomer(Customer *c) { 475*12c85518Srobert c->name = ...; 476*12c85518Srobert c->account_id = ...; 477*12c85518Srobert kGlobalCustomer = c; 478*12c85518Srobert} 479*12c85518Srobert``` 480*12c85518Srobert 481*12c85518SrobertTo identify a candidate function for refactoring, we need to do the following: 482*12c85518Srobert 483*12c85518Srobert* Find a function with a non-const pointer or reference parameter. 484*12c85518Srobert 485*12c85518Srobert* Find the definition of that function. 486*12c85518Srobert 487*12c85518Srobert* Prove that the function completely overwrites the pointee on all paths 488*12c85518Srobert before returning. 489*12c85518Srobert 490*12c85518Srobert* Prove that the function reads the pointee only after overwriting it. 491*12c85518Srobert 492*12c85518Srobert* Prove that the function does not persist the pointer in a data structure 493*12c85518Srobert that is live after the function returns. 494*12c85518Srobert 495*12c85518SrobertThere are also requirements that all usage sites of the candidate function must 496*12c85518Srobertsatisfy, for example, that function arguments do not alias, that users are not 497*12c85518Sroberttaking the address of the function, and so on. Let's consider verifying usage 498*12c85518Srobertsite conditions to be a separate static analysis problem. 499*12c85518Srobert 500*12c85518Srobert### Lattice design 501*12c85518Srobert 502*12c85518SrobertTo analyze the function body we can use a lattice which consists of normal 503*12c85518Srobertstates and failure states. A normal state describes program points where we are 504*12c85518Srobertsure that no behaviors that block the refactoring have occurred. Normal states 505*12c85518Srobertkeep track of all parameter's member fields that are known to be overwritten on 506*12c85518Srobertevery path from function entry to the corresponding program point. Failure 507*12c85518Srobertstates accumulate observed violations (unsafe reads and pointer escapes) that 508*12c85518Srobertblock the refactoring. 509*12c85518Srobert 510*12c85518SrobertIn the partial order of the lattice failure states compare greater than normal 511*12c85518Srobertstates, which guarantees that they "win" when joined with normal states. Order 512*12c85518Srobertbetween failure states is determined by inclusion relation on the set of 513*12c85518Srobertaccumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order 514*12c85518Srobertbetween normal states is determined by reversed inclusion relation on the set of 515*12c85518Srobertoverwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of 516*12c85518Srobertoverwritten fields). 517*12c85518Srobert 518*12c85518Srobert 519*12c85518Srobert 520*12c85518SrobertTo determine whether a statement reads or writes a field we can implement 521*12c85518Srobertsymbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer 522*12c85518Srobertdereference operator and `MemberExpr`s. 523*12c85518Srobert 524*12c85518Srobert### Using data flow results to identify output parameters 525*12c85518Srobert 526*12c85518SrobertLet's take a look at how we use data flow analysis to identify an output 527*12c85518Srobertparameter. The refactoring can be safely done when the data flow algorithm 528*12c85518Srobertcomputes a normal state with all of the fields proven to be overwritten in the 529*12c85518Srobertexit basic block of the function. 530*12c85518Srobert 531*12c85518Srobert```c++ 532*12c85518Srobertstruct Customer { 533*12c85518Srobert int account_id; 534*12c85518Srobert std::string name; 535*12c85518Srobert}; 536*12c85518Srobert 537*12c85518Srobertvoid GetCustomer(Customer* c) { 538*12c85518Srobert // Overwritten: {} 539*12c85518Srobert c->account_id = ...; // Overwritten: {c->account_id} 540*12c85518Srobert if (...) { 541*12c85518Srobert c->name = ...; // Overwritten: {c->account_id, c->name} 542*12c85518Srobert } else { 543*12c85518Srobert c->name = ...; // Overwritten: {c->account_id, c->name} 544*12c85518Srobert } 545*12c85518Srobert // Overwritten: {c->account_id, c->name} 546*12c85518Srobert} 547*12c85518Srobert``` 548*12c85518Srobert 549*12c85518SrobertWhen the data flow algorithm computes a normal state, but not all fields are 550*12c85518Srobertproven to be overwritten we can't perform the refactoring. 551*12c85518Srobert 552*12c85518Srobert```c++ 553*12c85518Srobertvoid target(bool b, Customer* c) { 554*12c85518Srobert // Overwritten: {} 555*12c85518Srobert if (b) { 556*12c85518Srobert c->account_id = 42; // Overwritten: {c->account_id} 557*12c85518Srobert } else { 558*12c85518Srobert c->name = "Konrad"; // Overwritten: {c->name} 559*12c85518Srobert } 560*12c85518Srobert // Overwritten: {} 561*12c85518Srobert} 562*12c85518Srobert``` 563*12c85518Srobert 564*12c85518SrobertSimilarly, when the data flow algorithm computes a failure state, we also can't 565*12c85518Srobertperform the refactoring. 566*12c85518Srobert 567*12c85518Srobert```c++ 568*12c85518SrobertCustomer* kGlobalCustomer; 569*12c85518Srobert 570*12c85518Srobertvoid GetCustomer(Customer* c) { 571*12c85518Srobert // Overwritten: {} 572*12c85518Srobert c->account_id = ...; // Overwritten: {c->account_id} 573*12c85518Srobert if (...) { 574*12c85518Srobert print(c->name); // Unsafe read 575*12c85518Srobert } else { 576*12c85518Srobert kGlobalCustomer = c; // Pointer escape 577*12c85518Srobert } 578*12c85518Srobert // Unsafe read, Pointer escape 579*12c85518Srobert} 580*12c85518Srobert``` 581*12c85518Srobert 582*12c85518Srobert## Example: finding dead stores 583*12c85518Srobert 584*12c85518SrobertLet's say we want to find redundant stores, because they indicate potential 585*12c85518Srobertbugs. 586*12c85518Srobert 587*12c85518Srobert```c++ 588*12c85518Srobertx = GetX(); 589*12c85518Srobertx = GetY(); 590*12c85518Srobert``` 591*12c85518Srobert 592*12c85518SrobertThe first store to `x` is never read, probably there is a bug. 593*12c85518Srobert 594*12c85518SrobertThe implementation of dead store analysis is very similar to output parameter 595*12c85518Srobertanalysis: we need to track stores and loads, and find stores that were never 596*12c85518Srobertread. 597*12c85518Srobert 598*12c85518Srobert[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a 599*12c85518Srobertgeneralization of this idea, which is often used to answer many related 600*12c85518Srobertquestions, for example: 601*12c85518Srobert 602*12c85518Srobert* finding dead stores, 603*12c85518Srobert* finding uninitialized variables, 604*12c85518Srobert* finding a good point to deallocate memory, 605*12c85518Srobert* finding out if it would be safe to move an object. 606*12c85518Srobert 607*12c85518Srobert## Example: definitive initialization 608*12c85518Srobert 609*12c85518SrobertDefinitive initialization proves that variables are known to be initialized when 610*12c85518Srobertread. If we find a variable which is read when not initialized then we generate 611*12c85518Sroberta warning. 612*12c85518Srobert 613*12c85518Srobert```c++ 614*12c85518Srobertvoid Init() { 615*12c85518Srobert int x; // x is uninitialized 616*12c85518Srobert if (cond()) { 617*12c85518Srobert x = 10; // x is initialized 618*12c85518Srobert } else { 619*12c85518Srobert x = 20; // x is initialized 620*12c85518Srobert } 621*12c85518Srobert print(x); // x is initialized 622*12c85518Srobert} 623*12c85518Srobert``` 624*12c85518Srobert 625*12c85518Srobert```c++ 626*12c85518Srobertvoid Uninit() { 627*12c85518Srobert int x; // x is uninitialized 628*12c85518Srobert if (cond()) { 629*12c85518Srobert x = 10; // x is initialized 630*12c85518Srobert } 631*12c85518Srobert print(x); // x is maybe uninitialized, x is being read, report a bug. 632*12c85518Srobert} 633*12c85518Srobert``` 634*12c85518Srobert 635*12c85518SrobertFor this purpose we can use lattice in a form of a mapping from variable 636*12c85518Srobertdeclarations to initialization states; each initialization state is represented 637*12c85518Srobertby the followingn lattice: 638*12c85518Srobert 639*12c85518Srobert 640*12c85518Srobert 641*12c85518SrobertA lattice element could also capture the source locations of the branches that 642*12c85518Srobertlead us to the corresponding program point. Diagnostics would use this 643*12c85518Srobertinformation to show a sample buggy code path to the user. 644*12c85518Srobert 645*12c85518Srobert## Example: refactoring raw pointers to `unique_ptr` 646*12c85518Srobert 647*12c85518SrobertModern idiomatic C++ uses smart pointers to express memory ownership, however in 648*12c85518Srobertpre-C++11 code one can often find raw pointers that own heap memory blocks. 649*12c85518Srobert 650*12c85518SrobertImagine that we would like to refactor raw pointers that own memory to 651*12c85518Srobert`unique_ptr`. There are multiple ways to design a data flow analysis for this 652*12c85518Srobertproblem; let's look at one way to do it. 653*12c85518Srobert 654*12c85518SrobertFor example, we would like to refactor the following code that uses raw 655*12c85518Srobertpointers: 656*12c85518Srobert 657*12c85518Srobert```c++ 658*12c85518Srobertvoid UniqueOwnership1() { 659*12c85518Srobert int *pi = new int; 660*12c85518Srobert if (...) { 661*12c85518Srobert Borrow(pi); 662*12c85518Srobert delete pi; 663*12c85518Srobert } else { 664*12c85518Srobert TakeOwnership(pi); 665*12c85518Srobert } 666*12c85518Srobert} 667*12c85518Srobert``` 668*12c85518Srobert 669*12c85518Srobertinto code that uses `unique_ptr`: 670*12c85518Srobert 671*12c85518Srobert```c++ 672*12c85518Srobertvoid UniqueOwnership1() { 673*12c85518Srobert auto pi = std::make_unique<int>(); 674*12c85518Srobert if (...) { 675*12c85518Srobert Borrow(pi.get()); 676*12c85518Srobert } else { 677*12c85518Srobert TakeOwnership(pi.release()); 678*12c85518Srobert } 679*12c85518Srobert} 680*12c85518Srobert``` 681*12c85518Srobert 682*12c85518SrobertThis problem can be solved with a lattice in form of map from value declarations 683*12c85518Srobertto pointer states: 684*12c85518Srobert 685*12c85518Srobert 686*12c85518Srobert 687*12c85518SrobertWe can perform the refactoring if at the exit of a function `pi` is 688*12c85518Srobert`Compatible`. 689*12c85518Srobert 690*12c85518Srobert```c++ 691*12c85518Srobertvoid UniqueOwnership1() { 692*12c85518Srobert int *pi; // pi is Compatible 693*12c85518Srobert pi = new int; // pi is Defined 694*12c85518Srobert if (...) { 695*12c85518Srobert Borrow(pi); // pi is Defined 696*12c85518Srobert delete pi; // pi is Compatible 697*12c85518Srobert } else { 698*12c85518Srobert TakeOwnership(pi); // pi is Compatible 699*12c85518Srobert } 700*12c85518Srobert // pi is Compatible 701*12c85518Srobert} 702*12c85518Srobert``` 703*12c85518Srobert 704*12c85518SrobertLet's look at an example where the raw pointer owns two different memory blocks: 705*12c85518Srobert 706*12c85518Srobert```c++ 707*12c85518Srobertvoid UniqueOwnership2() { 708*12c85518Srobert int *pi = new int; // pi is Defined 709*12c85518Srobert Borrow(pi); 710*12c85518Srobert delete pi; // pi is Compatible 711*12c85518Srobert if (smth) { 712*12c85518Srobert pi = new int; // pi is Defined 713*12c85518Srobert Borrow(pi); 714*12c85518Srobert delete pi; // pi is Compatible 715*12c85518Srobert } 716*12c85518Srobert // pi is Compatible 717*12c85518Srobert} 718*12c85518Srobert``` 719*12c85518Srobert 720*12c85518SrobertIt can be refactored to use `unique_ptr` like this: 721*12c85518Srobert 722*12c85518Srobert```c++ 723*12c85518Srobertvoid UniqueOwnership2() { 724*12c85518Srobert auto pi = make_unique<int>(); 725*12c85518Srobert Borrow(pi); 726*12c85518Srobert if (smth) { 727*12c85518Srobert pi = make_unique<int>(); 728*12c85518Srobert Borrow(pi); 729*12c85518Srobert } 730*12c85518Srobert} 731*12c85518Srobert``` 732*12c85518Srobert 733*12c85518SrobertIn the following example, the raw pointer is used to access the heap object 734*12c85518Srobertafter the ownership has been transferred. 735*12c85518Srobert 736*12c85518Srobert```c++ 737*12c85518Srobertvoid UniqueOwnership3() { 738*12c85518Srobert int *pi = new int; // pi is Defined 739*12c85518Srobert if (...) { 740*12c85518Srobert Borrow(pi); 741*12c85518Srobert delete pi; // pi is Compatible 742*12c85518Srobert } else { 743*12c85518Srobert vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible 744*12c85518Srobert print(*pi); 745*12c85518Srobert use(v); 746*12c85518Srobert } 747*12c85518Srobert // pi is Compatible 748*12c85518Srobert} 749*12c85518Srobert``` 750*12c85518Srobert 751*12c85518SrobertWe can refactor this code to use `unique_ptr`, however we would have to 752*12c85518Srobertintroduce a non-owning pointer variable, since we can't use the moved-from 753*12c85518Srobert`unique_ptr` to access the object: 754*12c85518Srobert 755*12c85518Srobert```c++ 756*12c85518Srobertvoid UniqueOwnership3() { 757*12c85518Srobert std::unique_ptr<int> pi = std::make_unique<int>(); 758*12c85518Srobert if (...) { 759*12c85518Srobert Borrow(pi); 760*12c85518Srobert } else { 761*12c85518Srobert int *pi_non_owning = pi.get(); 762*12c85518Srobert vector<unique_ptr<int>> v = {std::move(pi)}; 763*12c85518Srobert print(*pi_non_owning); 764*12c85518Srobert use(v); 765*12c85518Srobert } 766*12c85518Srobert} 767*12c85518Srobert``` 768*12c85518Srobert 769*12c85518SrobertIf the original code didn't call `delete` at the very end of the function, then 770*12c85518Srobertour refactoring may change the point at which we run the destructor and release 771*12c85518Srobertmemory. Specifically, if there is some user code after `delete`, then extending 772*12c85518Srobertthe lifetime of the object until the end of the function may hold locks for 773*12c85518Srobertlonger than necessary, introduce memory overhead etc. 774*12c85518Srobert 775*12c85518SrobertOne solution is to always replace `delete` with a call to `reset()`, and then 776*12c85518Srobertperform another analysis that removes unnecessary `reset()` calls. 777*12c85518Srobert 778*12c85518Srobert```c++ 779*12c85518Srobertvoid AddedMemoryOverhead() { 780*12c85518Srobert HugeObject *ho = new HugeObject(); 781*12c85518Srobert use(ho); 782*12c85518Srobert delete ho; // Release the large amount of memory quickly. 783*12c85518Srobert LongRunningFunction(); 784*12c85518Srobert} 785*12c85518Srobert``` 786*12c85518Srobert 787*12c85518SrobertThis analysis will refuse to refactor code that mixes borrowed pointer values 788*12c85518Srobertand unique ownership. In the following code, `GetPtr()` returns a borrowed 789*12c85518Srobertpointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned 790*12c85518Srobertpointer. We don't distinguish between these two assignments, and we want each 791*12c85518Srobertassignment to be paired with a corresponding sink; otherwise, we transition the 792*12c85518Srobertpointer to a `Conflicting` state, like in this example. 793*12c85518Srobert 794*12c85518Srobert```c++ 795*12c85518Srobertvoid ConflictingOwnership() { 796*12c85518Srobert int *pi; // pi is Compatible 797*12c85518Srobert pi = GetPtr(); // pi is Defined 798*12c85518Srobert Borrow(pi); // pi is Defined 799*12c85518Srobert 800*12c85518Srobert pi = new int; // pi is Conflicting 801*12c85518Srobert Borrow(pi); 802*12c85518Srobert delete pi; 803*12c85518Srobert // pi is Conflicting 804*12c85518Srobert} 805*12c85518Srobert``` 806*12c85518Srobert 807*12c85518SrobertWe could still handle this case by finding a maximal range in the code where 808*12c85518Srobert`pi` could be in the Compatible state, and only refactoring that part. 809*12c85518Srobert 810*12c85518Srobert```c++ 811*12c85518Srobertvoid ConflictingOwnership() { 812*12c85518Srobert int *pi; 813*12c85518Srobert pi = GetPtr(); 814*12c85518Srobert Borrow(pi); 815*12c85518Srobert 816*12c85518Srobert std::unique_ptr<int> pi_unique = std::make_unique<int>(); 817*12c85518Srobert Borrow(pi_unique.get()); 818*12c85518Srobert} 819*12c85518Srobert``` 820*12c85518Srobert 821*12c85518Srobert## Example: finding redundant branch conditions 822*12c85518Srobert 823*12c85518SrobertIn the code below `b1` should not be checked in both the outer and inner "if" 824*12c85518Srobertstatements. It is likely there is a bug in this code. 825*12c85518Srobert 826*12c85518Srobert```c++ 827*12c85518Srobertint F(bool b1, bool b2) { 828*12c85518Srobert if (b1) { 829*12c85518Srobert f(); 830*12c85518Srobert if (b1 && b2) { // Check `b1` again -- unnecessary! 831*12c85518Srobert g(); 832*12c85518Srobert } 833*12c85518Srobert } 834*12c85518Srobert} 835*12c85518Srobert``` 836*12c85518Srobert 837*12c85518SrobertA checker that finds this pattern syntactically is already implemented in 838*12c85518SrobertClangTidy using AST matchers (`bugprone-redundant-branch-condition`). 839*12c85518Srobert 840*12c85518SrobertTo implement it using the data flow analysis framework, we can produce a warning 841*12c85518Srobertif any part of the branch condition is implied by the flow condition. 842*12c85518Srobert 843*12c85518Srobert```c++ 844*12c85518Srobertint F(bool b1, bool b2) { 845*12c85518Srobert // Flow condition: true. 846*12c85518Srobert if (b1) { 847*12c85518Srobert // Flow condition: b1. 848*12c85518Srobert f(); 849*12c85518Srobert if (b1 && b2) { // `b1` is implied by the flow condition. 850*12c85518Srobert g(); 851*12c85518Srobert } 852*12c85518Srobert } 853*12c85518Srobert} 854*12c85518Srobert``` 855*12c85518Srobert 856*12c85518SrobertOne way to check this implication is to use a SAT solver. Without a SAT solver, 857*12c85518Srobertwe could keep the flow condition in the CNF form and then it would be easy to 858*12c85518Srobertcheck the implication. 859*12c85518Srobert 860*12c85518Srobert## Example: finding unchecked `std::optional` unwraps 861*12c85518Srobert 862*12c85518SrobertCalling `optional::value()` is only valid if `optional::has_value()` is true. We 863*12c85518Srobertwant to show that when `x.value()` is executed, the flow condition implies 864*12c85518Srobert`x.has_value()`. 865*12c85518Srobert 866*12c85518SrobertIn the example below `x.value()` is accessed safely because it is guarded by the 867*12c85518Srobert`x.has_value()` check. 868*12c85518Srobert 869*12c85518Srobert```c++ 870*12c85518Srobertvoid Example(std::optional<int> &x) { 871*12c85518Srobert if (x.has_value()) { 872*12c85518Srobert use(x.value()); 873*12c85518Srobert } 874*12c85518Srobert} 875*12c85518Srobert``` 876*12c85518Srobert 877*12c85518SrobertWhile entering the if branch we deduce that `x.has_value()` is implied by the 878*12c85518Srobertflow condition. 879*12c85518Srobert 880*12c85518Srobert```c++ 881*12c85518Srobertvoid Example(std::optional<int> x) { 882*12c85518Srobert // Flow condition: true. 883*12c85518Srobert if (x.has_value()) { 884*12c85518Srobert // Flow condition: x.has_value() == true. 885*12c85518Srobert use(x.value()); 886*12c85518Srobert } 887*12c85518Srobert // Flow condition: true. 888*12c85518Srobert} 889*12c85518Srobert``` 890*12c85518Srobert 891*12c85518SrobertWe also need to prove that `x` is not modified between check and value access. 892*12c85518SrobertThe modification of `x` may be very subtle: 893*12c85518Srobert 894*12c85518Srobert```c++ 895*12c85518Srobertvoid F(std::optional<int> &x); 896*12c85518Srobert 897*12c85518Srobertvoid Example(std::optional<int> &x) { 898*12c85518Srobert if (x.has_value()) { 899*12c85518Srobert // Flow condition: x.has_value() == true. 900*12c85518Srobert unknown_function(x); // may change x. 901*12c85518Srobert // Flow condition: true. 902*12c85518Srobert use(x.value()); 903*12c85518Srobert } 904*12c85518Srobert} 905*12c85518Srobert``` 906*12c85518Srobert 907*12c85518Srobert## Example: finding dead code behind A/B experiment flags 908*12c85518Srobert 909*12c85518SrobertFinding dead code is a classic application of data flow analysis. 910*12c85518Srobert 911*12c85518SrobertUnused flags for A/B experiment hide dead code. However, this flavor of dead 912*12c85518Srobertcode is invisible to the compiler because the flag can be turned on at any 913*12c85518Srobertmoment. 914*12c85518Srobert 915*12c85518SrobertWe could make a tool that deletes experiment flags. The user tells us which flag 916*12c85518Srobertthey want to delete, and we assume that the it's value is a given constant. 917*12c85518Srobert 918*12c85518SrobertFor example, the user could use the tool to remove `example_flag` from this 919*12c85518Srobertcode: 920*12c85518Srobert 921*12c85518Srobert```c++ 922*12c85518SrobertDEFINE_FLAG(std::string, example_flag, "", "A sample flag."); 923*12c85518Srobert 924*12c85518Srobertvoid Example() { 925*12c85518Srobert bool x = GetFlag(FLAGS_example_flag).empty(); 926*12c85518Srobert f(); 927*12c85518Srobert if (x) { 928*12c85518Srobert g(); 929*12c85518Srobert } else { 930*12c85518Srobert h(); 931*12c85518Srobert } 932*12c85518Srobert} 933*12c85518Srobert``` 934*12c85518Srobert 935*12c85518SrobertThe tool would simplify the code to: 936*12c85518Srobert 937*12c85518Srobert```c++ 938*12c85518Srobertvoid Example() { 939*12c85518Srobert f(); 940*12c85518Srobert g(); 941*12c85518Srobert} 942*12c85518Srobert``` 943*12c85518Srobert 944*12c85518SrobertWe can solve this problem with a classic constant propagation lattice combined 945*12c85518Srobertwith symbolic evaluation. 946*12c85518Srobert 947*12c85518Srobert## Example: finding inefficient usages of associative containers 948*12c85518Srobert 949*12c85518SrobertReal-world code often accidentally performs repeated lookups in associative 950*12c85518Srobertcontainers: 951*12c85518Srobert 952*12c85518Srobert```c++ 953*12c85518Srobertmap<int, Employee> xs; 954*12c85518Srobertxs[42]->name = "..."; 955*12c85518Srobertxs[42]->title = "..."; 956*12c85518Srobert``` 957*12c85518Srobert 958*12c85518SrobertTo find the above inefficiency we can use the available expressions analysis to 959*12c85518Srobertunderstand that `m[42]` is evaluated twice. 960*12c85518Srobert 961*12c85518Srobert```c++ 962*12c85518Srobertmap<int, Employee> xs; 963*12c85518SrobertEmployee &e = xs[42]; 964*12c85518Sroberte->name = "..."; 965*12c85518Sroberte->title = "..."; 966*12c85518Srobert``` 967*12c85518Srobert 968*12c85518SrobertWe can also track the `m.contains()` check in the flow condition to find 969*12c85518Srobertredundant checks, like in the example below. 970*12c85518Srobert 971*12c85518Srobert```c++ 972*12c85518Srobertstd::map<int, Employee> xs; 973*12c85518Srobertif (!xs.contains(42)) { 974*12c85518Srobert xs.insert({42, someEmployee}); 975*12c85518Srobert} 976*12c85518Srobert``` 977*12c85518Srobert 978*12c85518Srobert## Example: refactoring types that implicitly convert to each other 979*12c85518Srobert 980*12c85518SrobertRefactoring one strong type to another is difficult, but the compiler can help: 981*12c85518Srobertonce you refactor one reference to the type, the compiler will flag other places 982*12c85518Srobertwhere this information flows with type mismatch errors. Unfortunately this 983*12c85518Srobertstrategy does not work when you are refactoring types that implicitly convert to 984*12c85518Sroberteach other, for example, replacing `int32_t` with `int64_t`. 985*12c85518Srobert 986*12c85518SrobertImagine that we want to change user IDs from 32 to 64-bit integers. In other 987*12c85518Srobertwords, we need to find all integers tainted with user IDs. We can use data flow 988*12c85518Srobertanalysis to implement taint analysis. 989*12c85518Srobert 990*12c85518Srobert```c++ 991*12c85518Srobertvoid UseUser(int32_t user_id) { 992*12c85518Srobert int32_t id = user_id; 993*12c85518Srobert // Variable `id` is tainted with a user ID. 994*12c85518Srobert ... 995*12c85518Srobert} 996*12c85518Srobert``` 997*12c85518Srobert 998*12c85518SrobertTaint analysis is very well suited to this problem because the program rarely 999*12c85518Srobertbranches on user IDs, and almost certainly does not perform any computation 1000*12c85518Srobert(like arithmetic). 1001