1ab31d003SDmitri Gribenko# Data flow analysis: an informal introduction 2ab31d003SDmitri Gribenko 3ab31d003SDmitri Gribenko## Abstract 4ab31d003SDmitri Gribenko 5ab31d003SDmitri GribenkoThis document introduces data flow analysis in an informal way. The goal is to 6ab31d003SDmitri Gribenkogive the reader an intuitive understanding of how it works, and show how it 7ab31d003SDmitri Gribenkoapplies to a range of refactoring and bug finding problems. 8ab31d003SDmitri Gribenko 9ab31d003SDmitri GribenkoData flow analysis is a well-established technique; it is described in many 10ab31d003SDmitri Gribenkopapers, books, and videos. If you would like a more formal, or a more thorough 11ab31d003SDmitri Gribenkoexplanation of the concepts mentioned in this document, please refer to the 12ab31d003SDmitri Gribenkofollowing resources: 13ab31d003SDmitri Gribenko 14ab31d003SDmitri Gribenko* [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)). 15ab31d003SDmitri Gribenko* Videos on the PacketPrep YouTube channel that introduce lattices and the 16ab31d003SDmitri Gribenko necessary background information: 17ab31d003SDmitri Gribenko [#20](https://www.youtube.com/watch?v=73j_FXBXGm8), 18ab31d003SDmitri Gribenko [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8), 19ab31d003SDmitri Gribenko [#22](https://www.youtube.com/watch?v=saOG7Uooeho), 20ab31d003SDmitri Gribenko [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g), 21ab31d003SDmitri Gribenko [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc), 22ab31d003SDmitri Gribenko [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw). 23ab31d003SDmitri Gribenko* [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE) 24ab31d003SDmitri Gribenko* [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf). 25ab31d003SDmitri Gribenko* [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf). 26ab31d003SDmitri Gribenko* [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/). 27ab31d003SDmitri Gribenko* [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf) 28ab31d003SDmitri Gribenko (a paper that successfully applies symbolic execution to real-world 29ab31d003SDmitri Gribenko software). 30ab31d003SDmitri Gribenko 31ab31d003SDmitri Gribenko## Data flow analysis 32ab31d003SDmitri Gribenko 33ab31d003SDmitri Gribenko### The purpose of data flow analysis 34ab31d003SDmitri Gribenko 35ab31d003SDmitri GribenkoData flow analysis is a static analysis technique that proves facts about a 36ab31d003SDmitri Gribenkoprogram or its fragment. It can make conclusions about all paths through the 37ab31d003SDmitri Gribenkoprogram, while taking control flow into account and scaling to large programs. 38ab31d003SDmitri GribenkoThe basic idea is propagating facts about the program through the edges of the 39ab31d003SDmitri Gribenkocontrol flow graph (CFG) until a fixpoint is reached. 40ab31d003SDmitri Gribenko 41ab31d003SDmitri Gribenko### Sample problem and an ad-hoc solution 42ab31d003SDmitri Gribenko 43ab31d003SDmitri GribenkoWe would like to explain data flow analysis while discussing an example. Let's 44ab31d003SDmitri Gribenkoimagine that we want to track possible values of an integer variable in our 45ab31d003SDmitri Gribenkoprogram. Here is how a human could annotate the code: 46ab31d003SDmitri Gribenko 47ab31d003SDmitri Gribenko```c++ 48ab31d003SDmitri Gribenkovoid Example(int n) { 49ab31d003SDmitri Gribenko int x = 0; 50ab31d003SDmitri Gribenko // x is {0} 51ab31d003SDmitri Gribenko if (n > 0) { 52ab31d003SDmitri Gribenko x = 5; 53ab31d003SDmitri Gribenko // x is {5} 54ab31d003SDmitri Gribenko } else { 55ab31d003SDmitri Gribenko x = 42; 56ab31d003SDmitri Gribenko // x is {42} 57ab31d003SDmitri Gribenko } 58ab31d003SDmitri Gribenko // x is {5; 42} 59ab31d003SDmitri Gribenko print(x); 60ab31d003SDmitri Gribenko} 61ab31d003SDmitri Gribenko``` 62ab31d003SDmitri Gribenko 63ab31d003SDmitri GribenkoWe use sets of integers to represent possible values of `x`. Local variables 64ab31d003SDmitri Gribenkohave unambiguous values between statements, so we annotate program points 65ab31d003SDmitri Gribenkobetween statements with sets of possible values. 66ab31d003SDmitri Gribenko 67ab31d003SDmitri GribenkoHere is how we arrived at these annotations. Assigning a constant to `x` allows 68ab31d003SDmitri Gribenkous to make a conclusion that `x` can only have one value. When control flow from 69ab31d003SDmitri Gribenkothe "then" and "else" branches joins, `x` can have either value. 70ab31d003SDmitri Gribenko 71ab31d003SDmitri GribenkoAbstract algebra provides a nice formalism that models this kind of structure, 72ab31d003SDmitri Gribenkonamely, a lattice. A join-semilattice is a partially ordered set, in which every 73ab31d003SDmitri Gribenkotwo elements have a least upper bound (called a *join*). 74ab31d003SDmitri Gribenko 75ab31d003SDmitri Gribenko``` 76ab31d003SDmitri Gribenkojoin(a, b) ⩾ a and join(a, b) ⩾ b and join(x, x) = x 77ab31d003SDmitri Gribenko``` 78ab31d003SDmitri Gribenko 79ab31d003SDmitri GribenkoFor this problem we will use the lattice of subsets of integers, with set 80ab31d003SDmitri Gribenkoinclusion relation as ordering and set union as a join. 81ab31d003SDmitri Gribenko 82ab31d003SDmitri GribenkoLattices are often represented visually as Hasse diagrams. Here is a Hasse 83ab31d003SDmitri Gribenkodiagram for our lattice that tracks subsets of integers: 84ab31d003SDmitri Gribenko 85ab31d003SDmitri Gribenko 86ab31d003SDmitri Gribenko 87ab31d003SDmitri GribenkoComputing the join in the lattice corresponds to finding the lowest common 88ab31d003SDmitri Gribenkoancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of 89ab31d003SDmitri Gribenkoliterature on efficiently implementing LCA queries for a DAG, however Efficient 90ab31d003SDmitri GribenkoImplementation of Lattice Operations (1989) 91ab31d003SDmitri Gribenko([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911), 92ab31d003SDmitri Gribenko[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that 93ab31d003SDmitri Gribenkoparticularly well-suited for programmatic implementation. 94ab31d003SDmitri Gribenko 95ab31d003SDmitri Gribenko### Too much information and "top" values 96ab31d003SDmitri Gribenko 97ab31d003SDmitri GribenkoLet's try to find the possible sets of values of `x` in a function that modifies 98ab31d003SDmitri Gribenko`x` in a loop: 99ab31d003SDmitri Gribenko 100ab31d003SDmitri Gribenko```c++ 101ab31d003SDmitri Gribenkovoid ExampleOfInfiniteSets() { 102ab31d003SDmitri Gribenko int x = 0; // x is {0} 103ab31d003SDmitri Gribenko while (condition()) { 104ab31d003SDmitri Gribenko x += 1; // x is {0; 1; 2; …} 105ab31d003SDmitri Gribenko } 106ab31d003SDmitri Gribenko print(x); // x is {0; 1; 2; …} 107ab31d003SDmitri Gribenko} 108ab31d003SDmitri Gribenko``` 109ab31d003SDmitri Gribenko 110ab31d003SDmitri GribenkoWe have an issue: `x` can have any value greater than zero; that's an infinite 111ab31d003SDmitri Gribenkoset of values, if the program operated on mathematical integers. In C++ `int` is 112ab31d003SDmitri Gribenkolimited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is 113ab31d003SDmitri Gribenkostill really big. 114ab31d003SDmitri Gribenko 115ab31d003SDmitri GribenkoTo make our analysis practical to compute, we have to limit the amount of 116ab31d003SDmitri Gribenkoinformation that we track. In this case, we can, for example, arbitrarily limit 117ab31d003SDmitri Gribenkothe size of sets to 3 elements. If at a certain program point `x` has more than 118ab31d003SDmitri Gribenko3 possible values, we stop tracking specific values at that program point. 119ab31d003SDmitri GribenkoInstead, we denote possible values of `x` with the symbol `⊤` (pronounced "top" 120ab31d003SDmitri Gribenkoaccording to a convention in abstract algebra). 121ab31d003SDmitri Gribenko 122ab31d003SDmitri Gribenko```c++ 123ab31d003SDmitri Gribenkovoid ExampleOfTopWithALoop() { 124ab31d003SDmitri Gribenko int x = 0; // x is {0} 125ab31d003SDmitri Gribenko while (condition()) { 126ab31d003SDmitri Gribenko x += 1; // x is ⊤ 127ab31d003SDmitri Gribenko } 128ab31d003SDmitri Gribenko print(x); // x is ⊤ 129ab31d003SDmitri Gribenko} 130ab31d003SDmitri Gribenko``` 131ab31d003SDmitri Gribenko 132ab31d003SDmitri GribenkoThe statement "at this program point, `x`'s possible values are `⊤`" is 133ab31d003SDmitri Gribenkounderstood as "at this program point `x` can have any value because we have too 134ab31d003SDmitri Gribenkomuch information, or the information is conflicting". 135ab31d003SDmitri Gribenko 136ab31d003SDmitri GribenkoNote that we can get more than 3 possible values even without a loop: 137ab31d003SDmitri Gribenko 138ab31d003SDmitri Gribenko```c++ 139ab31d003SDmitri Gribenkovoid ExampleOfTopWithoutLoops(int n) { 140ab31d003SDmitri Gribenko int x = 0; // x is {0} 141ab31d003SDmitri Gribenko switch(n) { 142ab31d003SDmitri Gribenko case 0: x = 1; break; // x is {1} 143ab31d003SDmitri Gribenko case 1: x = 9; break; // x is {9} 144ab31d003SDmitri Gribenko case 2: x = 7; break; // x is {7} 145ab31d003SDmitri Gribenko default: x = 3; break; // x is {3} 146ab31d003SDmitri Gribenko } 147ab31d003SDmitri Gribenko // x is ⊤ 148ab31d003SDmitri Gribenko} 149ab31d003SDmitri Gribenko``` 150ab31d003SDmitri Gribenko 151ab31d003SDmitri Gribenko### Uninitialized variables and "bottom" values 152ab31d003SDmitri Gribenko 153ab31d003SDmitri GribenkoWhen `x` is declared but not initialized, it has no possible values. We 154ab31d003SDmitri Gribenkorepresent this fact symbolically as `⊥` (pronounced "bottom"). 155ab31d003SDmitri Gribenko 156ab31d003SDmitri Gribenko```c++ 157ab31d003SDmitri Gribenkovoid ExampleOfBottom() { 158ab31d003SDmitri Gribenko int x; // x is ⊥ 159ab31d003SDmitri Gribenko x = 42; // x is {42} 160ab31d003SDmitri Gribenko print(x); 161ab31d003SDmitri Gribenko} 162ab31d003SDmitri Gribenko``` 163ab31d003SDmitri Gribenko 164ab31d003SDmitri GribenkoNote that using values read from uninitialized variables is undefined behaviour 165ab31d003SDmitri Gribenkoin C++. Generally, compilers and static analysis tools can assume undefined 166ab31d003SDmitri Gribenkobehavior does not happen. We must model uninitialized variables only when we are 167ab31d003SDmitri Gribenkoimplementing a checker that specifically is trying to find uninitialized reads. 168ab31d003SDmitri GribenkoIn this example we show how to model uninitialized variables only to demonstrate 169ab31d003SDmitri Gribenkothe concept of "bottom", and how it applies to possible value analysis. We 170ab31d003SDmitri Gribenkodescribe an analysis that finds uninitialized reads in a section below. 171ab31d003SDmitri Gribenko 172ab31d003SDmitri Gribenko### A practical lattice that tracks sets of concrete values 173ab31d003SDmitri Gribenko 174ab31d003SDmitri GribenkoTaking into account all corner cases covered above, we can put together a 175ab31d003SDmitri Gribenkolattice that we can use in practice to track possible values of integer 176ab31d003SDmitri Gribenkovariables. This lattice represents sets of integers with 1, 2, or 3 elements, as 177ab31d003SDmitri Gribenkowell as top and bottom. Here is a Hasse diagram for it: 178ab31d003SDmitri Gribenko 179ab31d003SDmitri Gribenko 180ab31d003SDmitri Gribenko 181ab31d003SDmitri Gribenko### Formalization 182ab31d003SDmitri Gribenko 183ab31d003SDmitri GribenkoLet's consider a slightly more complex example, and think about how we can 184ab31d003SDmitri Gribenkocompute the sets of possible values algorithmically. 185ab31d003SDmitri Gribenko 186ab31d003SDmitri Gribenko```c++ 187ab31d003SDmitri Gribenkovoid Example(int n) { 188ab31d003SDmitri Gribenko int x; // x is ⊥ 189ab31d003SDmitri Gribenko if (n > 0) { 190ab31d003SDmitri Gribenko if (n == 42) { 191ab31d003SDmitri Gribenko x = 44; // x is {44} 192ab31d003SDmitri Gribenko } else { 193ab31d003SDmitri Gribenko x = 5; // x is {5} 194ab31d003SDmitri Gribenko } 195ab31d003SDmitri Gribenko print(x); // x is {44; 5} 196ab31d003SDmitri Gribenko } else { 197ab31d003SDmitri Gribenko x = n; // x is ⊤ 198ab31d003SDmitri Gribenko } 199ab31d003SDmitri Gribenko print(x); // x is ⊤ 200ab31d003SDmitri Gribenko} 201ab31d003SDmitri Gribenko``` 202ab31d003SDmitri Gribenko 203ab31d003SDmitri GribenkoAs humans, we understand the control flow from the program text. We used our 204ab31d003SDmitri Gribenkounderstanding of control flow to find program points where two flows join. 205ab31d003SDmitri GribenkoFormally, control flow is represented by a CFG (control flow graph): 206ab31d003SDmitri Gribenko 207ab31d003SDmitri Gribenko 208ab31d003SDmitri Gribenko 209ab31d003SDmitri GribenkoWe can compute sets of possible values by propagating them through the CFG of 210ab31d003SDmitri Gribenkothe function: 211ab31d003SDmitri Gribenko 212ab31d003SDmitri Gribenko* When `x` is declared but not initialized, its possible values are `{}`. The 213ab31d003SDmitri Gribenko empty set plays the role of `⊥` in this lattice. 214ab31d003SDmitri Gribenko 215ab31d003SDmitri Gribenko* When `x` is assigned a concrete value, its possible set of values contains 216ab31d003SDmitri Gribenko just that specific value. 217ab31d003SDmitri Gribenko 218ab31d003SDmitri Gribenko* When `x` is assigned some unknown value, it can have any value. We represent 219ab31d003SDmitri Gribenko this fact as `⊤`. 220ab31d003SDmitri Gribenko 221ab31d003SDmitri Gribenko* When two control flow paths join, we compute the set union of incoming 222*cc8237d9SAyushi Shukla values (limiting the number of elements to 3, representing larger sets as 223ab31d003SDmitri Gribenko `⊤`). 224ab31d003SDmitri Gribenko 225ab31d003SDmitri GribenkoThe sets of possible values are influenced by: 226ab31d003SDmitri Gribenko 227ab31d003SDmitri Gribenko* Statements, for example, assignments. 228ab31d003SDmitri Gribenko 229ab31d003SDmitri Gribenko* Joins in control flow, for example, ones that appear at the end of "if" 230ab31d003SDmitri Gribenko statements. 231ab31d003SDmitri Gribenko 232ab31d003SDmitri Gribenko**Effects of statements** are modeled by what is formally known as a transfer 233ab31d003SDmitri Gribenkofunction. A transfer function takes two arguments: the statement, and the state 234ab31d003SDmitri Gribenkoof `x` at the previous program point. It produces the state of `x` at the next 235ab31d003SDmitri Gribenkoprogram point. For example, the transfer function for assignment ignores the 236ab31d003SDmitri Gribenkostate at the previous program point: 237ab31d003SDmitri Gribenko 238ab31d003SDmitri Gribenko```c++ 239ab31d003SDmitri Gribenko// GIVEN: x is {42; 44} 240ab31d003SDmitri Gribenkox = 0; 241ab31d003SDmitri Gribenko// CONCLUSION: x is {0} 242ab31d003SDmitri Gribenko``` 243ab31d003SDmitri Gribenko 244ab31d003SDmitri GribenkoThe transfer function for `+` performs arithmetic on every set member: 245ab31d003SDmitri Gribenko 246ab31d003SDmitri Gribenko```c++ 247ab31d003SDmitri Gribenko// GIVEN: x is {42, 44} 248ab31d003SDmitri Gribenkox = x + 100; 249ab31d003SDmitri Gribenko// CONCLUSION: x is {142, 144} 250ab31d003SDmitri Gribenko``` 251ab31d003SDmitri Gribenko 252ab31d003SDmitri Gribenko**Effects of control flow** are modeled by joining the knowledge from all 253ab31d003SDmitri Gribenkopossible previous program points. 254ab31d003SDmitri Gribenko 255ab31d003SDmitri Gribenko```c++ 256ab31d003SDmitri Gribenkoif (...) { 257ab31d003SDmitri Gribenko ... 258ab31d003SDmitri Gribenko // GIVEN: x is {42} 259ab31d003SDmitri Gribenko} else { 260ab31d003SDmitri Gribenko ... 261ab31d003SDmitri Gribenko // GIVEN: x is {44} 262ab31d003SDmitri Gribenko} 263ab31d003SDmitri Gribenko// CONCLUSION: x is {42; 44} 264ab31d003SDmitri Gribenko``` 265ab31d003SDmitri Gribenko 266ab31d003SDmitri Gribenko```c++ 267ab31d003SDmitri Gribenko// GIVEN: x is {42} 268ab31d003SDmitri Gribenkowhile (...) { 269ab31d003SDmitri Gribenko ... 270ab31d003SDmitri Gribenko // GIVEN: x is {44} 271ab31d003SDmitri Gribenko} 272ab31d003SDmitri Gribenko// CONCLUSION: {42; 44} 273ab31d003SDmitri Gribenko``` 274ab31d003SDmitri Gribenko 275ab31d003SDmitri GribenkoThe predicate that we marked "given" is usually called a precondition, and the 276ab31d003SDmitri Gribenkoconclusion is called a postcondition. 277ab31d003SDmitri Gribenko 278ab31d003SDmitri GribenkoIn terms of the CFG, we join the information from all predecessor basic blocks. 279ab31d003SDmitri Gribenko 280ab31d003SDmitri Gribenko 281ab31d003SDmitri Gribenko 282ab31d003SDmitri GribenkoPutting it all together, to model the effects of a basic block we compute: 283ab31d003SDmitri Gribenko 284ab31d003SDmitri Gribenko``` 285ab31d003SDmitri Gribenkoout = transfer(basic_block, join(in_1, in_2, ..., in_n)) 286ab31d003SDmitri Gribenko``` 287ab31d003SDmitri Gribenko 288ab31d003SDmitri Gribenko(Note that there are other ways to write this equation that produce higher 289ab31d003SDmitri Gribenkoprecision analysis results. The trick is to keep exploring the execution paths 29090a6e354SphyBracketsseparately and delay joining until later. However, we won't discuss those 291ab31d003SDmitri Gribenkovariations here.) 292ab31d003SDmitri Gribenko 293ab31d003SDmitri GribenkoTo make a conclusion about all paths through the program, we repeat this 294ab31d003SDmitri Gribenkocomputation on all basic blocks until we reach a fixpoint. In other words, we 295ab31d003SDmitri Gribenkokeep propagating information through the CFG until the computed sets of values 296ab31d003SDmitri Gribenkostop changing. 297ab31d003SDmitri Gribenko 298ab31d003SDmitri GribenkoIf the lattice has a finite height and transfer functions are monotonic the 299ab31d003SDmitri Gribenkoalgorithm is guaranteed to terminate. Each iteration of the algorithm can 300ab31d003SDmitri Gribenkochange computed values only to larger values from the lattice. In the worst 301ab31d003SDmitri Gribenkocase, all computed values become `⊤`, which is not very useful, but at least the 302ab31d003SDmitri Gribenkoanalysis terminates at that point, because it can't change any of the values. 303ab31d003SDmitri Gribenko 304ab31d003SDmitri GribenkoFixpoint iteration can be optimised by only reprocessing basic blocks which had 305ab31d003SDmitri Gribenkoone of their inputs changed on the previous iteration. This is typically 306ab31d003SDmitri Gribenkoimplemented using a worklist queue. With this optimisation the time complexity 307ab31d003SDmitri Gribenkobecomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and 308ab31d003SDmitri Gribenko`|L|` is the size of lattice used by the analysis. 309ab31d003SDmitri Gribenko 310ab31d003SDmitri Gribenko## Symbolic execution: a very short informal introduction 311ab31d003SDmitri Gribenko 312ab31d003SDmitri Gribenko### Symbolic values 313ab31d003SDmitri Gribenko 314ab31d003SDmitri GribenkoIn the previous example where we tried to figure out what values a variable can 315ab31d003SDmitri Gribenkohave, the analysis had to be seeded with a concrete value. What if there are no 316ab31d003SDmitri Gribenkoassignments of concrete values in the program? We can still deduce some 317ab31d003SDmitri Gribenkointeresting information by representing unknown input values symbolically, and 318ab31d003SDmitri Gribenkocomputing results as symbolic expressions: 319ab31d003SDmitri Gribenko 320ab31d003SDmitri Gribenko```c++ 321ab31d003SDmitri Gribenkovoid PrintAbs(int x) { 322ab31d003SDmitri Gribenko int result; 323ab31d003SDmitri Gribenko if (x >= 0) { 324ab31d003SDmitri Gribenko result = x; // result is {x} 325ab31d003SDmitri Gribenko } else { 326ab31d003SDmitri Gribenko result = -x; // result is {-x} 327ab31d003SDmitri Gribenko } 328ab31d003SDmitri Gribenko print(result); // result is {x; -x} 329ab31d003SDmitri Gribenko} 330ab31d003SDmitri Gribenko``` 331ab31d003SDmitri Gribenko 332ab31d003SDmitri GribenkoWe can't say what specific value gets printed, but we know that it is either `x` 333ab31d003SDmitri Gribenkoor `-x`. 334ab31d003SDmitri Gribenko 335*cc8237d9SAyushi ShuklaDataflow analysis is an instance of abstract interpretation, and does not dictate 336ab31d003SDmitri Gribenkohow exactly the lattice and transfer functions should be designed, beyond the 337ab31d003SDmitri Gribenkonecessary conditions for the analysis to converge. Nevertheless, we can use 338ab31d003SDmitri Gribenkosymbolic execution ideas to guide our design of the lattice and transfer 339ab31d003SDmitri Gribenkofunctions: lattice values can be symbolic expressions, and transfer functions 340ab31d003SDmitri Gribenkocan construct more complex symbolic expressions from symbolic expressions that 341ab31d003SDmitri Gribenkorepresent arguments. See [this StackOverflow 342ab31d003SDmitri Gribenkodiscussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation) 343ab31d003SDmitri Gribenkofor a further comparison of abstract interpretation and symbolic execution. 344ab31d003SDmitri Gribenko 345ab31d003SDmitri Gribenko### Flow condition 346ab31d003SDmitri Gribenko 347ab31d003SDmitri GribenkoA human can say about the previous example that the function returns `x` when 348ab31d003SDmitri Gribenko`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by 349ab31d003SDmitri Gribenkotracking a flow condition. A flow condition is a predicate written in terms of 350ab31d003SDmitri Gribenkothe program state that is true at a specific program point regardless of the 351ab31d003SDmitri Gribenkoexecution path that led to this statement. For example, the flow condition for 352ab31d003SDmitri Gribenkothe program point right before evaluating `result = x` is `x >= 0`. 353ab31d003SDmitri Gribenko 354ab31d003SDmitri GribenkoIf we enhance the lattice to be a set of pairs of values and predicates, the 355ab31d003SDmitri Gribenkodataflow analysis computes the following values: 356ab31d003SDmitri Gribenko 357ab31d003SDmitri Gribenko```c++ 358ab31d003SDmitri Gribenkovoid PrintAbs(int x) { 359ab31d003SDmitri Gribenko int result; 360ab31d003SDmitri Gribenko if (x >= 0) { 361ab31d003SDmitri Gribenko // Flow condition: x >= 0. 362ab31d003SDmitri Gribenko result = x; // result is {x if x >= 0} 363ab31d003SDmitri Gribenko } else { 364ab31d003SDmitri Gribenko // Flow condition: x < 0. 365ab31d003SDmitri Gribenko result = -x; // result is {-x if x < 0} 366ab31d003SDmitri Gribenko } 367ab31d003SDmitri Gribenko print(result); // result is {x if x >= 0; -x if x < 0} 368ab31d003SDmitri Gribenko} 369ab31d003SDmitri Gribenko``` 370ab31d003SDmitri Gribenko 371ab31d003SDmitri GribenkoOf course, in a program with loops, symbolic expressions for flow conditions can 372ab31d003SDmitri Gribenkogrow unbounded. A practical static analysis system must control this growth to 373ab31d003SDmitri Gribenkokeep the symbolic representations manageable and ensure that the data flow 374ab31d003SDmitri Gribenkoanalysis terminates. For example, it can use a constraint solver to prune 375ab31d003SDmitri Gribenkoimpossible flow conditions, and/or it can abstract them, losing precision, after 376ab31d003SDmitri Gribenkotheir symbolic representations grow beyond some threshold. This is similar to 377ab31d003SDmitri Gribenkohow we had to limit the sizes of computed sets of possible values to 3 elements. 378ab31d003SDmitri Gribenko 379ab31d003SDmitri Gribenko### Symbolic pointers 380ab31d003SDmitri Gribenko 381ab31d003SDmitri GribenkoThis approach proves to be particularly useful for modeling pointer values, 382ab31d003SDmitri Gribenkosince we don't care about specific addresses but just want to give a unique 383ab31d003SDmitri Gribenkoidentifier to a memory location. 384ab31d003SDmitri Gribenko 385ab31d003SDmitri Gribenko```c++ 386ab31d003SDmitri Gribenkovoid ExampleOfSymbolicPointers(bool b) { 387ab31d003SDmitri Gribenko int x = 0; // x is {0} 388ab31d003SDmitri Gribenko int* ptr = &x; // x is {0} ptr is {&x} 389ab31d003SDmitri Gribenko if (b) { 390ab31d003SDmitri Gribenko *ptr = 42; // x is {42} ptr is {&x} 391ab31d003SDmitri Gribenko } 392ab31d003SDmitri Gribenko print(x); // x is {0; 42} ptr is {&x} 393ab31d003SDmitri Gribenko} 394ab31d003SDmitri Gribenko``` 395ab31d003SDmitri Gribenko 396ab31d003SDmitri Gribenko## Example: finding output parameters 397ab31d003SDmitri Gribenko 398ab31d003SDmitri GribenkoLet's explore how data flow analysis can help with a problem that is hard to 399ab31d003SDmitri Gribenkosolve with other tools in Clang. 400ab31d003SDmitri Gribenko 401ab31d003SDmitri Gribenko### Problem description 402ab31d003SDmitri Gribenko 403ab31d003SDmitri GribenkoOutput parameters are function parameters of pointer or reference type whose 404ab31d003SDmitri Gribenkopointee is completely overwritten by the function, and not read before it is 405ab31d003SDmitri Gribenkooverwritten. They are common in pre-C++11 code due to the absence of move 406ab31d003SDmitri Gribenkosemantics. In modern C++ output parameters are non-idiomatic, and return values 407ab31d003SDmitri Gribenkoare used instead. 408ab31d003SDmitri Gribenko 409ab31d003SDmitri GribenkoImagine that we would like to refactor output parameters to return values to 410ab31d003SDmitri Gribenkomodernize old code. The first step is to identify refactoring candidates through 411ab31d003SDmitri Gribenkostatic analysis. 412ab31d003SDmitri Gribenko 413ab31d003SDmitri GribenkoFor example, in the following code snippet the pointer `c` is an output 414ab31d003SDmitri Gribenkoparameter: 415ab31d003SDmitri Gribenko 416ab31d003SDmitri Gribenko```c++ 417ab31d003SDmitri Gribenkostruct Customer { 418ab31d003SDmitri Gribenko int account_id; 419ab31d003SDmitri Gribenko std::string name; 420ab31d003SDmitri Gribenko} 421ab31d003SDmitri Gribenko 422ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) { 423ab31d003SDmitri Gribenko c->account_id = ...; 424ab31d003SDmitri Gribenko if (...) { 425ab31d003SDmitri Gribenko c->name = ...; 426ab31d003SDmitri Gribenko } else { 427ab31d003SDmitri Gribenko c->name = ...; 428ab31d003SDmitri Gribenko } 429ab31d003SDmitri Gribenko} 430ab31d003SDmitri Gribenko``` 431ab31d003SDmitri Gribenko 432ab31d003SDmitri GribenkoWe would like to refactor this code into: 433ab31d003SDmitri Gribenko 434ab31d003SDmitri Gribenko```c++ 435ab31d003SDmitri GribenkoCustomer GetCustomer() { 436ab31d003SDmitri Gribenko Customer c; 437ab31d003SDmitri Gribenko c.account_id = ...; 438ab31d003SDmitri Gribenko if (...) { 439ab31d003SDmitri Gribenko c.name = ...; 440ab31d003SDmitri Gribenko } else { 441ab31d003SDmitri Gribenko c.name = ...; 442ab31d003SDmitri Gribenko } 443ab31d003SDmitri Gribenko return c; 444ab31d003SDmitri Gribenko} 445ab31d003SDmitri Gribenko``` 446ab31d003SDmitri Gribenko 447ab31d003SDmitri GribenkoHowever, in the function below the parameter `c` is not an output parameter 448ab31d003SDmitri Gribenkobecause its field `name` is not overwritten on every path through the function. 449ab31d003SDmitri Gribenko 450ab31d003SDmitri Gribenko```c++ 451ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) { 452ab31d003SDmitri Gribenko c->account_id = ...; 453ab31d003SDmitri Gribenko if (...) { 454ab31d003SDmitri Gribenko c->name = ...; 455ab31d003SDmitri Gribenko } 456ab31d003SDmitri Gribenko} 457ab31d003SDmitri Gribenko``` 458ab31d003SDmitri Gribenko 459ab31d003SDmitri GribenkoThe code also cannot read the value of the parameter before overwriting it: 460ab31d003SDmitri Gribenko 461ab31d003SDmitri Gribenko```c++ 462ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) { 463ab31d003SDmitri Gribenko use(c->account_id); 464ab31d003SDmitri Gribenko c->name = ...; 465ab31d003SDmitri Gribenko c->account_id = ...; 466ab31d003SDmitri Gribenko} 467ab31d003SDmitri Gribenko``` 468ab31d003SDmitri Gribenko 469ab31d003SDmitri GribenkoFunctions that escape the pointer also block the refactoring: 470ab31d003SDmitri Gribenko 471ab31d003SDmitri Gribenko```c++ 472ab31d003SDmitri GribenkoCustomer* kGlobalCustomer; 473ab31d003SDmitri Gribenko 474ab31d003SDmitri Gribenkovoid GetCustomer(Customer *c) { 475ab31d003SDmitri Gribenko c->name = ...; 476ab31d003SDmitri Gribenko c->account_id = ...; 477ab31d003SDmitri Gribenko kGlobalCustomer = c; 478ab31d003SDmitri Gribenko} 479ab31d003SDmitri Gribenko``` 480ab31d003SDmitri Gribenko 481ab31d003SDmitri GribenkoTo identify a candidate function for refactoring, we need to do the following: 482ab31d003SDmitri Gribenko 483ab31d003SDmitri Gribenko* Find a function with a non-const pointer or reference parameter. 484ab31d003SDmitri Gribenko 485ab31d003SDmitri Gribenko* Find the definition of that function. 486ab31d003SDmitri Gribenko 487ab31d003SDmitri Gribenko* Prove that the function completely overwrites the pointee on all paths 488ab31d003SDmitri Gribenko before returning. 489ab31d003SDmitri Gribenko 490ab31d003SDmitri Gribenko* Prove that the function reads the pointee only after overwriting it. 491ab31d003SDmitri Gribenko 492ab31d003SDmitri Gribenko* Prove that the function does not persist the pointer in a data structure 493ab31d003SDmitri Gribenko that is live after the function returns. 494ab31d003SDmitri Gribenko 495ab31d003SDmitri GribenkoThere are also requirements that all usage sites of the candidate function must 496ab31d003SDmitri Gribenkosatisfy, for example, that function arguments do not alias, that users are not 497ab31d003SDmitri Gribenkotaking the address of the function, and so on. Let's consider verifying usage 498ab31d003SDmitri Gribenkosite conditions to be a separate static analysis problem. 499ab31d003SDmitri Gribenko 500ab31d003SDmitri Gribenko### Lattice design 501ab31d003SDmitri Gribenko 502ab31d003SDmitri GribenkoTo analyze the function body we can use a lattice which consists of normal 503ab31d003SDmitri Gribenkostates and failure states. A normal state describes program points where we are 504ab31d003SDmitri Gribenkosure that no behaviors that block the refactoring have occurred. Normal states 505ab31d003SDmitri Gribenkokeep track of all parameter's member fields that are known to be overwritten on 506ab31d003SDmitri Gribenkoevery path from function entry to the corresponding program point. Failure 507ab31d003SDmitri Gribenkostates accumulate observed violations (unsafe reads and pointer escapes) that 508ab31d003SDmitri Gribenkoblock the refactoring. 509ab31d003SDmitri Gribenko 510ab31d003SDmitri GribenkoIn the partial order of the lattice failure states compare greater than normal 511ab31d003SDmitri Gribenkostates, which guarantees that they "win" when joined with normal states. Order 512ab31d003SDmitri Gribenkobetween failure states is determined by inclusion relation on the set of 513ab31d003SDmitri Gribenkoaccumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order 514ab31d003SDmitri Gribenkobetween normal states is determined by reversed inclusion relation on the set of 515ab31d003SDmitri Gribenkooverwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of 516ab31d003SDmitri Gribenkooverwritten fields). 517ab31d003SDmitri Gribenko 518ab31d003SDmitri Gribenko 519ab31d003SDmitri Gribenko 520ab31d003SDmitri GribenkoTo determine whether a statement reads or writes a field we can implement 521ab31d003SDmitri Gribenkosymbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer 522ab31d003SDmitri Gribenkodereference operator and `MemberExpr`s. 523ab31d003SDmitri Gribenko 524ab31d003SDmitri Gribenko### Using data flow results to identify output parameters 525ab31d003SDmitri Gribenko 526ab31d003SDmitri GribenkoLet's take a look at how we use data flow analysis to identify an output 527ab31d003SDmitri Gribenkoparameter. The refactoring can be safely done when the data flow algorithm 528ab31d003SDmitri Gribenkocomputes a normal state with all of the fields proven to be overwritten in the 529ab31d003SDmitri Gribenkoexit basic block of the function. 530ab31d003SDmitri Gribenko 531ab31d003SDmitri Gribenko```c++ 532ab31d003SDmitri Gribenkostruct Customer { 533ab31d003SDmitri Gribenko int account_id; 534ab31d003SDmitri Gribenko std::string name; 535ab31d003SDmitri Gribenko}; 536ab31d003SDmitri Gribenko 537ab31d003SDmitri Gribenkovoid GetCustomer(Customer* c) { 538ab31d003SDmitri Gribenko // Overwritten: {} 539ab31d003SDmitri Gribenko c->account_id = ...; // Overwritten: {c->account_id} 540ab31d003SDmitri Gribenko if (...) { 541ab31d003SDmitri Gribenko c->name = ...; // Overwritten: {c->account_id, c->name} 542ab31d003SDmitri Gribenko } else { 543ab31d003SDmitri Gribenko c->name = ...; // Overwritten: {c->account_id, c->name} 544ab31d003SDmitri Gribenko } 545ab31d003SDmitri Gribenko // Overwritten: {c->account_id, c->name} 546ab31d003SDmitri Gribenko} 547ab31d003SDmitri Gribenko``` 548ab31d003SDmitri Gribenko 549ab31d003SDmitri GribenkoWhen the data flow algorithm computes a normal state, but not all fields are 550ab31d003SDmitri Gribenkoproven to be overwritten we can't perform the refactoring. 551ab31d003SDmitri Gribenko 552ab31d003SDmitri Gribenko```c++ 553ab31d003SDmitri Gribenkovoid target(bool b, Customer* c) { 554ab31d003SDmitri Gribenko // Overwritten: {} 555ab31d003SDmitri Gribenko if (b) { 556ab31d003SDmitri Gribenko c->account_id = 42; // Overwritten: {c->account_id} 557ab31d003SDmitri Gribenko } else { 558ab31d003SDmitri Gribenko c->name = "Konrad"; // Overwritten: {c->name} 559ab31d003SDmitri Gribenko } 560ab31d003SDmitri Gribenko // Overwritten: {} 561ab31d003SDmitri Gribenko} 562ab31d003SDmitri Gribenko``` 563ab31d003SDmitri Gribenko 564ab31d003SDmitri GribenkoSimilarly, when the data flow algorithm computes a failure state, we also can't 565ab31d003SDmitri Gribenkoperform the refactoring. 566ab31d003SDmitri Gribenko 567ab31d003SDmitri Gribenko```c++ 568ab31d003SDmitri GribenkoCustomer* kGlobalCustomer; 569ab31d003SDmitri Gribenko 570ab31d003SDmitri Gribenkovoid GetCustomer(Customer* c) { 571ab31d003SDmitri Gribenko // Overwritten: {} 572ab31d003SDmitri Gribenko c->account_id = ...; // Overwritten: {c->account_id} 573ab31d003SDmitri Gribenko if (...) { 574ab31d003SDmitri Gribenko print(c->name); // Unsafe read 575ab31d003SDmitri Gribenko } else { 576ab31d003SDmitri Gribenko kGlobalCustomer = c; // Pointer escape 577ab31d003SDmitri Gribenko } 578ab31d003SDmitri Gribenko // Unsafe read, Pointer escape 579ab31d003SDmitri Gribenko} 580ab31d003SDmitri Gribenko``` 581ab31d003SDmitri Gribenko 582ab31d003SDmitri Gribenko## Example: finding dead stores 583ab31d003SDmitri Gribenko 584ab31d003SDmitri GribenkoLet's say we want to find redundant stores, because they indicate potential 585ab31d003SDmitri Gribenkobugs. 586ab31d003SDmitri Gribenko 587ab31d003SDmitri Gribenko```c++ 588ab31d003SDmitri Gribenkox = GetX(); 589ab31d003SDmitri Gribenkox = GetY(); 590ab31d003SDmitri Gribenko``` 591ab31d003SDmitri Gribenko 592ab31d003SDmitri GribenkoThe first store to `x` is never read, probably there is a bug. 593ab31d003SDmitri Gribenko 594ab31d003SDmitri GribenkoThe implementation of dead store analysis is very similar to output parameter 595ab31d003SDmitri Gribenkoanalysis: we need to track stores and loads, and find stores that were never 596ab31d003SDmitri Gribenkoread. 597ab31d003SDmitri Gribenko 598ab31d003SDmitri Gribenko[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a 599ab31d003SDmitri Gribenkogeneralization of this idea, which is often used to answer many related 600ab31d003SDmitri Gribenkoquestions, for example: 601ab31d003SDmitri Gribenko 602ab31d003SDmitri Gribenko* finding dead stores, 603ab31d003SDmitri Gribenko* finding uninitialized variables, 604ab31d003SDmitri Gribenko* finding a good point to deallocate memory, 605ab31d003SDmitri Gribenko* finding out if it would be safe to move an object. 606ab31d003SDmitri Gribenko 607ab31d003SDmitri Gribenko## Example: definitive initialization 608ab31d003SDmitri Gribenko 609ab31d003SDmitri GribenkoDefinitive initialization proves that variables are known to be initialized when 610ab31d003SDmitri Gribenkoread. If we find a variable which is read when not initialized then we generate 611ab31d003SDmitri Gribenkoa warning. 612ab31d003SDmitri Gribenko 613ab31d003SDmitri Gribenko```c++ 614ab31d003SDmitri Gribenkovoid Init() { 615ab31d003SDmitri Gribenko int x; // x is uninitialized 616ab31d003SDmitri Gribenko if (cond()) { 617ab31d003SDmitri Gribenko x = 10; // x is initialized 618ab31d003SDmitri Gribenko } else { 619ab31d003SDmitri Gribenko x = 20; // x is initialized 620ab31d003SDmitri Gribenko } 621ab31d003SDmitri Gribenko print(x); // x is initialized 622ab31d003SDmitri Gribenko} 623ab31d003SDmitri Gribenko``` 624ab31d003SDmitri Gribenko 625ab31d003SDmitri Gribenko```c++ 626ab31d003SDmitri Gribenkovoid Uninit() { 627ab31d003SDmitri Gribenko int x; // x is uninitialized 628ab31d003SDmitri Gribenko if (cond()) { 629ab31d003SDmitri Gribenko x = 10; // x is initialized 630ab31d003SDmitri Gribenko } 631ab31d003SDmitri Gribenko print(x); // x is maybe uninitialized, x is being read, report a bug. 632ab31d003SDmitri Gribenko} 633ab31d003SDmitri Gribenko``` 634ab31d003SDmitri Gribenko 635ab31d003SDmitri GribenkoFor this purpose we can use lattice in a form of a mapping from variable 636ab31d003SDmitri Gribenkodeclarations to initialization states; each initialization state is represented 637*cc8237d9SAyushi Shuklaby the following lattice: 638ab31d003SDmitri Gribenko 639ab31d003SDmitri Gribenko 640ab31d003SDmitri Gribenko 641ab31d003SDmitri GribenkoA lattice element could also capture the source locations of the branches that 642ab31d003SDmitri Gribenkolead us to the corresponding program point. Diagnostics would use this 643ab31d003SDmitri Gribenkoinformation to show a sample buggy code path to the user. 644ab31d003SDmitri Gribenko 645ab31d003SDmitri Gribenko## Example: refactoring raw pointers to `unique_ptr` 646ab31d003SDmitri Gribenko 647ab31d003SDmitri GribenkoModern idiomatic C++ uses smart pointers to express memory ownership, however in 648ab31d003SDmitri Gribenkopre-C++11 code one can often find raw pointers that own heap memory blocks. 649ab31d003SDmitri Gribenko 650ab31d003SDmitri GribenkoImagine that we would like to refactor raw pointers that own memory to 651ab31d003SDmitri Gribenko`unique_ptr`. There are multiple ways to design a data flow analysis for this 652ab31d003SDmitri Gribenkoproblem; let's look at one way to do it. 653ab31d003SDmitri Gribenko 654ab31d003SDmitri GribenkoFor example, we would like to refactor the following code that uses raw 655ab31d003SDmitri Gribenkopointers: 656ab31d003SDmitri Gribenko 657ab31d003SDmitri Gribenko```c++ 658ab31d003SDmitri Gribenkovoid UniqueOwnership1() { 659ab31d003SDmitri Gribenko int *pi = new int; 660ab31d003SDmitri Gribenko if (...) { 661ab31d003SDmitri Gribenko Borrow(pi); 662ab31d003SDmitri Gribenko delete pi; 663ab31d003SDmitri Gribenko } else { 664ab31d003SDmitri Gribenko TakeOwnership(pi); 665ab31d003SDmitri Gribenko } 666ab31d003SDmitri Gribenko} 667ab31d003SDmitri Gribenko``` 668ab31d003SDmitri Gribenko 669ab31d003SDmitri Gribenkointo code that uses `unique_ptr`: 670ab31d003SDmitri Gribenko 671ab31d003SDmitri Gribenko```c++ 672ab31d003SDmitri Gribenkovoid UniqueOwnership1() { 673ab31d003SDmitri Gribenko auto pi = std::make_unique<int>(); 674ab31d003SDmitri Gribenko if (...) { 675ab31d003SDmitri Gribenko Borrow(pi.get()); 676ab31d003SDmitri Gribenko } else { 677ab31d003SDmitri Gribenko TakeOwnership(pi.release()); 678ab31d003SDmitri Gribenko } 679ab31d003SDmitri Gribenko} 680ab31d003SDmitri Gribenko``` 681ab31d003SDmitri Gribenko 682ab31d003SDmitri GribenkoThis problem can be solved with a lattice in form of map from value declarations 683ab31d003SDmitri Gribenkoto pointer states: 684ab31d003SDmitri Gribenko 685ab31d003SDmitri Gribenko 686ab31d003SDmitri Gribenko 687ab31d003SDmitri GribenkoWe can perform the refactoring if at the exit of a function `pi` is 688ab31d003SDmitri Gribenko`Compatible`. 689ab31d003SDmitri Gribenko 690ab31d003SDmitri Gribenko```c++ 691ab31d003SDmitri Gribenkovoid UniqueOwnership1() { 692ab31d003SDmitri Gribenko int *pi; // pi is Compatible 693ab31d003SDmitri Gribenko pi = new int; // pi is Defined 694ab31d003SDmitri Gribenko if (...) { 695ab31d003SDmitri Gribenko Borrow(pi); // pi is Defined 696ab31d003SDmitri Gribenko delete pi; // pi is Compatible 697ab31d003SDmitri Gribenko } else { 698ab31d003SDmitri Gribenko TakeOwnership(pi); // pi is Compatible 699ab31d003SDmitri Gribenko } 700ab31d003SDmitri Gribenko // pi is Compatible 701ab31d003SDmitri Gribenko} 702ab31d003SDmitri Gribenko``` 703ab31d003SDmitri Gribenko 704ab31d003SDmitri GribenkoLet's look at an example where the raw pointer owns two different memory blocks: 705ab31d003SDmitri Gribenko 706ab31d003SDmitri Gribenko```c++ 707ab31d003SDmitri Gribenkovoid UniqueOwnership2() { 708ab31d003SDmitri Gribenko int *pi = new int; // pi is Defined 709ab31d003SDmitri Gribenko Borrow(pi); 710ab31d003SDmitri Gribenko delete pi; // pi is Compatible 711ab31d003SDmitri Gribenko if (smth) { 712ab31d003SDmitri Gribenko pi = new int; // pi is Defined 713ab31d003SDmitri Gribenko Borrow(pi); 714ab31d003SDmitri Gribenko delete pi; // pi is Compatible 715ab31d003SDmitri Gribenko } 716ab31d003SDmitri Gribenko // pi is Compatible 717ab31d003SDmitri Gribenko} 718ab31d003SDmitri Gribenko``` 719ab31d003SDmitri Gribenko 720ab31d003SDmitri GribenkoIt can be refactored to use `unique_ptr` like this: 721ab31d003SDmitri Gribenko 722ab31d003SDmitri Gribenko```c++ 723ab31d003SDmitri Gribenkovoid UniqueOwnership2() { 724ab31d003SDmitri Gribenko auto pi = make_unique<int>(); 725ab31d003SDmitri Gribenko Borrow(pi); 726ab31d003SDmitri Gribenko if (smth) { 727ab31d003SDmitri Gribenko pi = make_unique<int>(); 728ab31d003SDmitri Gribenko Borrow(pi); 729ab31d003SDmitri Gribenko } 730ab31d003SDmitri Gribenko} 731ab31d003SDmitri Gribenko``` 732ab31d003SDmitri Gribenko 733ab31d003SDmitri GribenkoIn the following example, the raw pointer is used to access the heap object 734ab31d003SDmitri Gribenkoafter the ownership has been transferred. 735ab31d003SDmitri Gribenko 736ab31d003SDmitri Gribenko```c++ 737ab31d003SDmitri Gribenkovoid UniqueOwnership3() { 738ab31d003SDmitri Gribenko int *pi = new int; // pi is Defined 739ab31d003SDmitri Gribenko if (...) { 740ab31d003SDmitri Gribenko Borrow(pi); 741ab31d003SDmitri Gribenko delete pi; // pi is Compatible 742ab31d003SDmitri Gribenko } else { 743ab31d003SDmitri Gribenko vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible 744ab31d003SDmitri Gribenko print(*pi); 745ab31d003SDmitri Gribenko use(v); 746ab31d003SDmitri Gribenko } 747ab31d003SDmitri Gribenko // pi is Compatible 748ab31d003SDmitri Gribenko} 749ab31d003SDmitri Gribenko``` 750ab31d003SDmitri Gribenko 751ab31d003SDmitri GribenkoWe can refactor this code to use `unique_ptr`, however we would have to 752ab31d003SDmitri Gribenkointroduce a non-owning pointer variable, since we can't use the moved-from 753ab31d003SDmitri Gribenko`unique_ptr` to access the object: 754ab31d003SDmitri Gribenko 755ab31d003SDmitri Gribenko```c++ 756ab31d003SDmitri Gribenkovoid UniqueOwnership3() { 757ab31d003SDmitri Gribenko std::unique_ptr<int> pi = std::make_unique<int>(); 758ab31d003SDmitri Gribenko if (...) { 759ab31d003SDmitri Gribenko Borrow(pi); 760ab31d003SDmitri Gribenko } else { 761ab31d003SDmitri Gribenko int *pi_non_owning = pi.get(); 762ab31d003SDmitri Gribenko vector<unique_ptr<int>> v = {std::move(pi)}; 763ab31d003SDmitri Gribenko print(*pi_non_owning); 764ab31d003SDmitri Gribenko use(v); 765ab31d003SDmitri Gribenko } 766ab31d003SDmitri Gribenko} 767ab31d003SDmitri Gribenko``` 768ab31d003SDmitri Gribenko 769ab31d003SDmitri GribenkoIf the original code didn't call `delete` at the very end of the function, then 770ab31d003SDmitri Gribenkoour refactoring may change the point at which we run the destructor and release 771ab31d003SDmitri Gribenkomemory. Specifically, if there is some user code after `delete`, then extending 772ab31d003SDmitri Gribenkothe lifetime of the object until the end of the function may hold locks for 773ab31d003SDmitri Gribenkolonger than necessary, introduce memory overhead etc. 774ab31d003SDmitri Gribenko 775ab31d003SDmitri GribenkoOne solution is to always replace `delete` with a call to `reset()`, and then 776ab31d003SDmitri Gribenkoperform another analysis that removes unnecessary `reset()` calls. 777ab31d003SDmitri Gribenko 778ab31d003SDmitri Gribenko```c++ 779ab31d003SDmitri Gribenkovoid AddedMemoryOverhead() { 780ab31d003SDmitri Gribenko HugeObject *ho = new HugeObject(); 781ab31d003SDmitri Gribenko use(ho); 782ab31d003SDmitri Gribenko delete ho; // Release the large amount of memory quickly. 783ab31d003SDmitri Gribenko LongRunningFunction(); 784ab31d003SDmitri Gribenko} 785ab31d003SDmitri Gribenko``` 786ab31d003SDmitri Gribenko 787ab31d003SDmitri GribenkoThis analysis will refuse to refactor code that mixes borrowed pointer values 788ab31d003SDmitri Gribenkoand unique ownership. In the following code, `GetPtr()` returns a borrowed 789ab31d003SDmitri Gribenkopointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned 790ab31d003SDmitri Gribenkopointer. We don't distinguish between these two assignments, and we want each 791ab31d003SDmitri Gribenkoassignment to be paired with a corresponding sink; otherwise, we transition the 792ab31d003SDmitri Gribenkopointer to a `Conflicting` state, like in this example. 793ab31d003SDmitri Gribenko 794ab31d003SDmitri Gribenko```c++ 795ab31d003SDmitri Gribenkovoid ConflictingOwnership() { 796ab31d003SDmitri Gribenko int *pi; // pi is Compatible 797ab31d003SDmitri Gribenko pi = GetPtr(); // pi is Defined 798ab31d003SDmitri Gribenko Borrow(pi); // pi is Defined 799ab31d003SDmitri Gribenko 800ab31d003SDmitri Gribenko pi = new int; // pi is Conflicting 801ab31d003SDmitri Gribenko Borrow(pi); 802ab31d003SDmitri Gribenko delete pi; 803ab31d003SDmitri Gribenko // pi is Conflicting 804ab31d003SDmitri Gribenko} 805ab31d003SDmitri Gribenko``` 806ab31d003SDmitri Gribenko 807ab31d003SDmitri GribenkoWe could still handle this case by finding a maximal range in the code where 808ab31d003SDmitri Gribenko`pi` could be in the Compatible state, and only refactoring that part. 809ab31d003SDmitri Gribenko 810ab31d003SDmitri Gribenko```c++ 811ab31d003SDmitri Gribenkovoid ConflictingOwnership() { 812ab31d003SDmitri Gribenko int *pi; 813ab31d003SDmitri Gribenko pi = GetPtr(); 814ab31d003SDmitri Gribenko Borrow(pi); 815ab31d003SDmitri Gribenko 816ab31d003SDmitri Gribenko std::unique_ptr<int> pi_unique = std::make_unique<int>(); 817ab31d003SDmitri Gribenko Borrow(pi_unique.get()); 818ab31d003SDmitri Gribenko} 819ab31d003SDmitri Gribenko``` 820ab31d003SDmitri Gribenko 821ab31d003SDmitri Gribenko## Example: finding redundant branch conditions 822ab31d003SDmitri Gribenko 823ab31d003SDmitri GribenkoIn the code below `b1` should not be checked in both the outer and inner "if" 824ab31d003SDmitri Gribenkostatements. It is likely there is a bug in this code. 825ab31d003SDmitri Gribenko 826ab31d003SDmitri Gribenko```c++ 827ab31d003SDmitri Gribenkoint F(bool b1, bool b2) { 828ab31d003SDmitri Gribenko if (b1) { 829ab31d003SDmitri Gribenko f(); 830ab31d003SDmitri Gribenko if (b1 && b2) { // Check `b1` again -- unnecessary! 831ab31d003SDmitri Gribenko g(); 832ab31d003SDmitri Gribenko } 833ab31d003SDmitri Gribenko } 834ab31d003SDmitri Gribenko} 835ab31d003SDmitri Gribenko``` 836ab31d003SDmitri Gribenko 837ab31d003SDmitri GribenkoA checker that finds this pattern syntactically is already implemented in 838ab31d003SDmitri GribenkoClangTidy using AST matchers (`bugprone-redundant-branch-condition`). 839ab31d003SDmitri Gribenko 840ab31d003SDmitri GribenkoTo implement it using the data flow analysis framework, we can produce a warning 841ab31d003SDmitri Gribenkoif any part of the branch condition is implied by the flow condition. 842ab31d003SDmitri Gribenko 843ab31d003SDmitri Gribenko```c++ 844ab31d003SDmitri Gribenkoint F(bool b1, bool b2) { 845ab31d003SDmitri Gribenko // Flow condition: true. 846ab31d003SDmitri Gribenko if (b1) { 847ab31d003SDmitri Gribenko // Flow condition: b1. 848ab31d003SDmitri Gribenko f(); 849ab31d003SDmitri Gribenko if (b1 && b2) { // `b1` is implied by the flow condition. 850ab31d003SDmitri Gribenko g(); 851ab31d003SDmitri Gribenko } 852ab31d003SDmitri Gribenko } 853ab31d003SDmitri Gribenko} 854ab31d003SDmitri Gribenko``` 855ab31d003SDmitri Gribenko 856ab31d003SDmitri GribenkoOne way to check this implication is to use a SAT solver. Without a SAT solver, 857ab31d003SDmitri Gribenkowe could keep the flow condition in the CNF form and then it would be easy to 858ab31d003SDmitri Gribenkocheck the implication. 859ab31d003SDmitri Gribenko 860ab31d003SDmitri Gribenko## Example: finding unchecked `std::optional` unwraps 861ab31d003SDmitri Gribenko 862ab31d003SDmitri GribenkoCalling `optional::value()` is only valid if `optional::has_value()` is true. We 863ab31d003SDmitri Gribenkowant to show that when `x.value()` is executed, the flow condition implies 864ab31d003SDmitri Gribenko`x.has_value()`. 865ab31d003SDmitri Gribenko 866ab31d003SDmitri GribenkoIn the example below `x.value()` is accessed safely because it is guarded by the 867ab31d003SDmitri Gribenko`x.has_value()` check. 868ab31d003SDmitri Gribenko 869ab31d003SDmitri Gribenko```c++ 870ab31d003SDmitri Gribenkovoid Example(std::optional<int> &x) { 871ab31d003SDmitri Gribenko if (x.has_value()) { 872ab31d003SDmitri Gribenko use(x.value()); 873ab31d003SDmitri Gribenko } 874ab31d003SDmitri Gribenko} 875ab31d003SDmitri Gribenko``` 876ab31d003SDmitri Gribenko 877ab31d003SDmitri GribenkoWhile entering the if branch we deduce that `x.has_value()` is implied by the 878ab31d003SDmitri Gribenkoflow condition. 879ab31d003SDmitri Gribenko 880ab31d003SDmitri Gribenko```c++ 881ab31d003SDmitri Gribenkovoid Example(std::optional<int> x) { 882ab31d003SDmitri Gribenko // Flow condition: true. 883ab31d003SDmitri Gribenko if (x.has_value()) { 884ab31d003SDmitri Gribenko // Flow condition: x.has_value() == true. 885ab31d003SDmitri Gribenko use(x.value()); 886ab31d003SDmitri Gribenko } 887ab31d003SDmitri Gribenko // Flow condition: true. 888ab31d003SDmitri Gribenko} 889ab31d003SDmitri Gribenko``` 890ab31d003SDmitri Gribenko 891ab31d003SDmitri GribenkoWe also need to prove that `x` is not modified between check and value access. 892ab31d003SDmitri GribenkoThe modification of `x` may be very subtle: 893ab31d003SDmitri Gribenko 894ab31d003SDmitri Gribenko```c++ 895ab31d003SDmitri Gribenkovoid F(std::optional<int> &x); 896ab31d003SDmitri Gribenko 897ab31d003SDmitri Gribenkovoid Example(std::optional<int> &x) { 898ab31d003SDmitri Gribenko if (x.has_value()) { 899ab31d003SDmitri Gribenko // Flow condition: x.has_value() == true. 900ab31d003SDmitri Gribenko unknown_function(x); // may change x. 901ab31d003SDmitri Gribenko // Flow condition: true. 902ab31d003SDmitri Gribenko use(x.value()); 903ab31d003SDmitri Gribenko } 904ab31d003SDmitri Gribenko} 905ab31d003SDmitri Gribenko``` 906ab31d003SDmitri Gribenko 907ab31d003SDmitri Gribenko## Example: finding dead code behind A/B experiment flags 908ab31d003SDmitri Gribenko 909ab31d003SDmitri GribenkoFinding dead code is a classic application of data flow analysis. 910ab31d003SDmitri Gribenko 911ab31d003SDmitri GribenkoUnused flags for A/B experiment hide dead code. However, this flavor of dead 912ab31d003SDmitri Gribenkocode is invisible to the compiler because the flag can be turned on at any 913ab31d003SDmitri Gribenkomoment. 914ab31d003SDmitri Gribenko 915ab31d003SDmitri GribenkoWe could make a tool that deletes experiment flags. The user tells us which flag 916ab31d003SDmitri Gribenkothey want to delete, and we assume that the it's value is a given constant. 917ab31d003SDmitri Gribenko 918ab31d003SDmitri GribenkoFor example, the user could use the tool to remove `example_flag` from this 919ab31d003SDmitri Gribenkocode: 920ab31d003SDmitri Gribenko 921ab31d003SDmitri Gribenko```c++ 922ab31d003SDmitri GribenkoDEFINE_FLAG(std::string, example_flag, "", "A sample flag."); 923ab31d003SDmitri Gribenko 924ab31d003SDmitri Gribenkovoid Example() { 925ab31d003SDmitri Gribenko bool x = GetFlag(FLAGS_example_flag).empty(); 926ab31d003SDmitri Gribenko f(); 927ab31d003SDmitri Gribenko if (x) { 928ab31d003SDmitri Gribenko g(); 929ab31d003SDmitri Gribenko } else { 930ab31d003SDmitri Gribenko h(); 931ab31d003SDmitri Gribenko } 932ab31d003SDmitri Gribenko} 933ab31d003SDmitri Gribenko``` 934ab31d003SDmitri Gribenko 935ab31d003SDmitri GribenkoThe tool would simplify the code to: 936ab31d003SDmitri Gribenko 937ab31d003SDmitri Gribenko```c++ 938ab31d003SDmitri Gribenkovoid Example() { 939ab31d003SDmitri Gribenko f(); 940ab31d003SDmitri Gribenko g(); 941ab31d003SDmitri Gribenko} 942ab31d003SDmitri Gribenko``` 943ab31d003SDmitri Gribenko 944ab31d003SDmitri GribenkoWe can solve this problem with a classic constant propagation lattice combined 945ab31d003SDmitri Gribenkowith symbolic evaluation. 946ab31d003SDmitri Gribenko 947ab31d003SDmitri Gribenko## Example: finding inefficient usages of associative containers 948ab31d003SDmitri Gribenko 949ab31d003SDmitri GribenkoReal-world code often accidentally performs repeated lookups in associative 950ab31d003SDmitri Gribenkocontainers: 951ab31d003SDmitri Gribenko 952ab31d003SDmitri Gribenko```c++ 953ab31d003SDmitri Gribenkomap<int, Employee> xs; 954ab31d003SDmitri Gribenkoxs[42]->name = "..."; 955ab31d003SDmitri Gribenkoxs[42]->title = "..."; 956ab31d003SDmitri Gribenko``` 957ab31d003SDmitri Gribenko 958ab31d003SDmitri GribenkoTo find the above inefficiency we can use the available expressions analysis to 959ab31d003SDmitri Gribenkounderstand that `m[42]` is evaluated twice. 960ab31d003SDmitri Gribenko 961ab31d003SDmitri Gribenko```c++ 962ab31d003SDmitri Gribenkomap<int, Employee> xs; 963ab31d003SDmitri GribenkoEmployee &e = xs[42]; 964ab31d003SDmitri Gribenkoe->name = "..."; 965ab31d003SDmitri Gribenkoe->title = "..."; 966ab31d003SDmitri Gribenko``` 967ab31d003SDmitri Gribenko 968ab31d003SDmitri GribenkoWe can also track the `m.contains()` check in the flow condition to find 969ab31d003SDmitri Gribenkoredundant checks, like in the example below. 970ab31d003SDmitri Gribenko 971ab31d003SDmitri Gribenko```c++ 972ab31d003SDmitri Gribenkostd::map<int, Employee> xs; 973ab31d003SDmitri Gribenkoif (!xs.contains(42)) { 974ab31d003SDmitri Gribenko xs.insert({42, someEmployee}); 975ab31d003SDmitri Gribenko} 976ab31d003SDmitri Gribenko``` 977ab31d003SDmitri Gribenko 978ab31d003SDmitri Gribenko## Example: refactoring types that implicitly convert to each other 979ab31d003SDmitri Gribenko 980ab31d003SDmitri GribenkoRefactoring one strong type to another is difficult, but the compiler can help: 981ab31d003SDmitri Gribenkoonce you refactor one reference to the type, the compiler will flag other places 982ab31d003SDmitri Gribenkowhere this information flows with type mismatch errors. Unfortunately this 983ab31d003SDmitri Gribenkostrategy does not work when you are refactoring types that implicitly convert to 984ab31d003SDmitri Gribenkoeach other, for example, replacing `int32_t` with `int64_t`. 985ab31d003SDmitri Gribenko 986ab31d003SDmitri GribenkoImagine that we want to change user IDs from 32 to 64-bit integers. In other 987ab31d003SDmitri Gribenkowords, we need to find all integers tainted with user IDs. We can use data flow 988ab31d003SDmitri Gribenkoanalysis to implement taint analysis. 989ab31d003SDmitri Gribenko 990ab31d003SDmitri Gribenko```c++ 991ab31d003SDmitri Gribenkovoid UseUser(int32_t user_id) { 992ab31d003SDmitri Gribenko int32_t id = user_id; 993ab31d003SDmitri Gribenko // Variable `id` is tainted with a user ID. 994ab31d003SDmitri Gribenko ... 995ab31d003SDmitri Gribenko} 996ab31d003SDmitri Gribenko``` 997ab31d003SDmitri Gribenko 998ab31d003SDmitri GribenkoTaint analysis is very well suited to this problem because the program rarely 999ab31d003SDmitri Gribenkobranches on user IDs, and almost certainly does not perform any computation 1000ab31d003SDmitri Gribenko(like arithmetic). 1001