1# Data flow analysis: an informal introduction 2 3## Abstract 4 5This document introduces data flow analysis in an informal way. The goal is to 6give the reader an intuitive understanding of how it works, and show how it 7applies to a range of refactoring and bug finding problems. 8 9Data flow analysis is a well-established technique; it is described in many 10papers, books, and videos. If you would like a more formal, or a more thorough 11explanation of the concepts mentioned in this document, please refer to the 12following resources: 13 14* [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)). 15* Videos on the PacketPrep YouTube channel that introduce lattices and the 16 necessary background information: 17 [#20](https://www.youtube.com/watch?v=73j_FXBXGm8), 18 [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8), 19 [#22](https://www.youtube.com/watch?v=saOG7Uooeho), 20 [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g), 21 [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc), 22 [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw). 23* [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE) 24* [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf). 25* [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf). 26* [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/). 27* [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf) 28 (a paper that successfully applies symbolic execution to real-world 29 software). 30 31## Data flow analysis 32 33### The purpose of data flow analysis 34 35Data flow analysis is a static analysis technique that proves facts about a 36program or its fragment. It can make conclusions about all paths through the 37program, while taking control flow into account and scaling to large programs. 38The basic idea is propagating facts about the program through the edges of the 39control flow graph (CFG) until a fixpoint is reached. 40 41### Sample problem and an ad-hoc solution 42 43We would like to explain data flow analysis while discussing an example. Let's 44imagine that we want to track possible values of an integer variable in our 45program. Here is how a human could annotate the code: 46 47```c++ 48void Example(int n) { 49 int x = 0; 50 // x is {0} 51 if (n > 0) { 52 x = 5; 53 // x is {5} 54 } else { 55 x = 42; 56 // x is {42} 57 } 58 // x is {5; 42} 59 print(x); 60} 61``` 62 63We use sets of integers to represent possible values of `x`. Local variables 64have unambiguous values between statements, so we annotate program points 65between statements with sets of possible values. 66 67Here is how we arrived at these annotations. Assigning a constant to `x` allows 68us to make a conclusion that `x` can only have one value. When control flow from 69the "then" and "else" branches joins, `x` can have either value. 70 71Abstract algebra provides a nice formalism that models this kind of structure, 72namely, a lattice. A join-semilattice is a partially ordered set, in which every 73two elements have a least upper bound (called a *join*). 74 75``` 76join(a, b) ⩾ a and join(a, b) ⩾ b and join(x, x) = x 77``` 78 79For this problem we will use the lattice of subsets of integers, with set 80inclusion relation as ordering and set union as a join. 81 82Lattices are often represented visually as Hasse diagrams. Here is a Hasse 83diagram for our lattice that tracks subsets of integers: 84 85 86 87Computing the join in the lattice corresponds to finding the lowest common 88ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of 89literature on efficiently implementing LCA queries for a DAG, however Efficient 90Implementation of Lattice Operations (1989) 91([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911), 92[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that 93particularly well-suited for programmatic implementation. 94 95### Too much information and "top" values 96 97Let's try to find the possible sets of values of `x` in a function that modifies 98`x` in a loop: 99 100```c++ 101void ExampleOfInfiniteSets() { 102 int x = 0; // x is {0} 103 while (condition()) { 104 x += 1; // x is {0; 1; 2; …} 105 } 106 print(x); // x is {0; 1; 2; …} 107} 108``` 109 110We have an issue: `x` can have any value greater than zero; that's an infinite 111set of values, if the program operated on mathematical integers. In C++ `int` is 112limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is 113still really big. 114 115To make our analysis practical to compute, we have to limit the amount of 116information that we track. In this case, we can, for example, arbitrarily limit 117the size of sets to 3 elements. If at a certain program point `x` has more than 1183 possible values, we stop tracking specific values at that program point. 119Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top" 120according to a convention in abstract algebra). 121 122```c++ 123void ExampleOfTopWithALoop() { 124 int x = 0; // x is {0} 125 while (condition()) { 126 x += 1; // x is ⊤ 127 } 128 print(x); // x is ⊤ 129} 130``` 131 132The statement "at this program point, `x`'s possible values are `⊤`" is 133understood as "at this program point `x` can have any value because we have too 134much information, or the information is conflicting". 135 136Note that we can get more than 3 possible values even without a loop: 137 138```c++ 139void ExampleOfTopWithoutLoops(int n) { 140 int x = 0; // x is {0} 141 switch(n) { 142 case 0: x = 1; break; // x is {1} 143 case 1: x = 9; break; // x is {9} 144 case 2: x = 7; break; // x is {7} 145 default: x = 3; break; // x is {3} 146 } 147 // x is ⊤ 148} 149``` 150 151### Uninitialized variables and "bottom" values 152 153When `x` is declared but not initialized, it has no possible values. We 154represent this fact symbolically as `⊥` (pronounced "bottom"). 155 156```c++ 157void ExampleOfBottom() { 158 int x; // x is ⊥ 159 x = 42; // x is {42} 160 print(x); 161} 162``` 163 164Note that using values read from uninitialized variables is undefined behaviour 165in C++. Generally, compilers and static analysis tools can assume undefined 166behavior does not happen. We must model uninitialized variables only when we are 167implementing a checker that specifically is trying to find uninitialized reads. 168In this example we show how to model uninitialized variables only to demonstrate 169the concept of "bottom", and how it applies to possible value analysis. We 170describe an analysis that finds uninitialized reads in a section below. 171 172### A practical lattice that tracks sets of concrete values 173 174Taking into account all corner cases covered above, we can put together a 175lattice that we can use in practice to track possible values of integer 176variables. This lattice represents sets of integers with 1, 2, or 3 elements, as 177well as top and bottom. Here is a Hasse diagram for it: 178 179 180 181### Formalization 182 183Let's consider a slightly more complex example, and think about how we can 184compute the sets of possible values algorithmically. 185 186```c++ 187void Example(int n) { 188 int x; // x is ⊥ 189 if (n > 0) { 190 if (n == 42) { 191 x = 44; // x is {44} 192 } else { 193 x = 5; // x is {5} 194 } 195 print(x); // x is {44; 5} 196 } else { 197 x = n; // x is ⊤ 198 } 199 print(x); // x is ⊤ 200} 201``` 202 203As humans, we understand the control flow from the program text. We used our 204understanding of control flow to find program points where two flows join. 205Formally, control flow is represented by a CFG (control flow graph): 206 207 208 209We can compute sets of possible values by propagating them through the CFG of 210the function: 211 212* When `x` is declared but not initialized, its possible values are `{}`. The 213 empty set plays the role of `⊥` in this lattice. 214 215* When `x` is assigned a concrete value, its possible set of values contains 216 just that specific value. 217 218* When `x` is assigned some unknown value, it can have any value. We represent 219 this fact as `⊤`. 220 221* When two control flow paths join, we compute the set union of incoming 222 values (limiting the number of elements to 3, representing larger sets as 223 `⊤`). 224 225The sets of possible values are influenced by: 226 227* Statements, for example, assignments. 228 229* Joins in control flow, for example, ones that appear at the end of "if" 230 statements. 231 232**Effects of statements** are modeled by what is formally known as a transfer 233function. A transfer function takes two arguments: the statement, and the state 234of `x` at the previous program point. It produces the state of `x` at the next 235program point. For example, the transfer function for assignment ignores the 236state at the previous program point: 237 238```c++ 239// GIVEN: x is {42; 44} 240x = 0; 241// CONCLUSION: x is {0} 242``` 243 244The transfer function for `+` performs arithmetic on every set member: 245 246```c++ 247// GIVEN: x is {42, 44} 248x = x + 100; 249// CONCLUSION: x is {142, 144} 250``` 251 252**Effects of control flow** are modeled by joining the knowledge from all 253possible previous program points. 254 255```c++ 256if (...) { 257 ... 258 // GIVEN: x is {42} 259} else { 260 ... 261 // GIVEN: x is {44} 262} 263// CONCLUSION: x is {42; 44} 264``` 265 266```c++ 267// GIVEN: x is {42} 268while (...) { 269 ... 270 // GIVEN: x is {44} 271} 272// CONCLUSION: {42; 44} 273``` 274 275The predicate that we marked "given" is usually called a precondition, and the 276conclusion is called a postcondition. 277 278In terms of the CFG, we join the information from all predecessor basic blocks. 279 280 281 282Putting it all together, to model the effects of a basic block we compute: 283 284``` 285out = transfer(basic_block, join(in_1, in_2, ..., in_n)) 286``` 287 288(Note that there are other ways to write this equation that produce higher 289precision analysis results. The trick is to keep exploring the execution paths 290separately and delay joining until later. However, we won't discuss those 291variations here.) 292 293To make a conclusion about all paths through the program, we repeat this 294computation on all basic blocks until we reach a fixpoint. In other words, we 295keep propagating information through the CFG until the computed sets of values 296stop changing. 297 298If the lattice has a finite height and transfer functions are monotonic the 299algorithm is guaranteed to terminate. Each iteration of the algorithm can 300change computed values only to larger values from the lattice. In the worst 301case, all computed values become `⊤`, which is not very useful, but at least the 302analysis terminates at that point, because it can't change any of the values. 303 304Fixpoint iteration can be optimised by only reprocessing basic blocks which had 305one of their inputs changed on the previous iteration. This is typically 306implemented using a worklist queue. With this optimisation the time complexity 307becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and 308`|L|` is the size of lattice used by the analysis. 309 310## Symbolic execution: a very short informal introduction 311 312### Symbolic values 313 314In the previous example where we tried to figure out what values a variable can 315have, the analysis had to be seeded with a concrete value. What if there are no 316assignments of concrete values in the program? We can still deduce some 317interesting information by representing unknown input values symbolically, and 318computing results as symbolic expressions: 319 320```c++ 321void PrintAbs(int x) { 322 int result; 323 if (x >= 0) { 324 result = x; // result is {x} 325 } else { 326 result = -x; // result is {-x} 327 } 328 print(result); // result is {x; -x} 329} 330``` 331 332We can't say what specific value gets printed, but we know that it is either `x` 333or `-x`. 334 335Dataflow analysis is an instance of abstract interpretation, and does not dictate 336how exactly the lattice and transfer functions should be designed, beyond the 337necessary conditions for the analysis to converge. Nevertheless, we can use 338symbolic execution ideas to guide our design of the lattice and transfer 339functions: lattice values can be symbolic expressions, and transfer functions 340can construct more complex symbolic expressions from symbolic expressions that 341represent arguments. See [this StackOverflow 342discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation) 343for a further comparison of abstract interpretation and symbolic execution. 344 345### Flow condition 346 347A human can say about the previous example that the function returns `x` when 348`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by 349tracking a flow condition. A flow condition is a predicate written in terms of 350the program state that is true at a specific program point regardless of the 351execution path that led to this statement. For example, the flow condition for 352the program point right before evaluating `result = x` is `x >= 0`. 353 354If we enhance the lattice to be a set of pairs of values and predicates, the 355dataflow analysis computes the following values: 356 357```c++ 358void PrintAbs(int x) { 359 int result; 360 if (x >= 0) { 361 // Flow condition: x >= 0. 362 result = x; // result is {x if x >= 0} 363 } else { 364 // Flow condition: x < 0. 365 result = -x; // result is {-x if x < 0} 366 } 367 print(result); // result is {x if x >= 0; -x if x < 0} 368} 369``` 370 371Of course, in a program with loops, symbolic expressions for flow conditions can 372grow unbounded. A practical static analysis system must control this growth to 373keep the symbolic representations manageable and ensure that the data flow 374analysis terminates. For example, it can use a constraint solver to prune 375impossible flow conditions, and/or it can abstract them, losing precision, after 376their symbolic representations grow beyond some threshold. This is similar to 377how we had to limit the sizes of computed sets of possible values to 3 elements. 378 379### Symbolic pointers 380 381This approach proves to be particularly useful for modeling pointer values, 382since we don't care about specific addresses but just want to give a unique 383identifier to a memory location. 384 385```c++ 386void ExampleOfSymbolicPointers(bool b) { 387 int x = 0; // x is {0} 388 int* ptr = &x; // x is {0} ptr is {&x} 389 if (b) { 390 *ptr = 42; // x is {42} ptr is {&x} 391 } 392 print(x); // x is {0; 42} ptr is {&x} 393} 394``` 395 396## Example: finding output parameters 397 398Let's explore how data flow analysis can help with a problem that is hard to 399solve with other tools in Clang. 400 401### Problem description 402 403Output parameters are function parameters of pointer or reference type whose 404pointee is completely overwritten by the function, and not read before it is 405overwritten. They are common in pre-C++11 code due to the absence of move 406semantics. In modern C++ output parameters are non-idiomatic, and return values 407are used instead. 408 409Imagine that we would like to refactor output parameters to return values to 410modernize old code. The first step is to identify refactoring candidates through 411static analysis. 412 413For example, in the following code snippet the pointer `c` is an output 414parameter: 415 416```c++ 417struct Customer { 418 int account_id; 419 std::string name; 420} 421 422void GetCustomer(Customer *c) { 423 c->account_id = ...; 424 if (...) { 425 c->name = ...; 426 } else { 427 c->name = ...; 428 } 429} 430``` 431 432We would like to refactor this code into: 433 434```c++ 435Customer GetCustomer() { 436 Customer c; 437 c.account_id = ...; 438 if (...) { 439 c.name = ...; 440 } else { 441 c.name = ...; 442 } 443 return c; 444} 445``` 446 447However, in the function below the parameter `c` is not an output parameter 448because its field `name` is not overwritten on every path through the function. 449 450```c++ 451void GetCustomer(Customer *c) { 452 c->account_id = ...; 453 if (...) { 454 c->name = ...; 455 } 456} 457``` 458 459The code also cannot read the value of the parameter before overwriting it: 460 461```c++ 462void GetCustomer(Customer *c) { 463 use(c->account_id); 464 c->name = ...; 465 c->account_id = ...; 466} 467``` 468 469Functions that escape the pointer also block the refactoring: 470 471```c++ 472Customer* kGlobalCustomer; 473 474void GetCustomer(Customer *c) { 475 c->name = ...; 476 c->account_id = ...; 477 kGlobalCustomer = c; 478} 479``` 480 481To identify a candidate function for refactoring, we need to do the following: 482 483* Find a function with a non-const pointer or reference parameter. 484 485* Find the definition of that function. 486 487* Prove that the function completely overwrites the pointee on all paths 488 before returning. 489 490* Prove that the function reads the pointee only after overwriting it. 491 492* Prove that the function does not persist the pointer in a data structure 493 that is live after the function returns. 494 495There are also requirements that all usage sites of the candidate function must 496satisfy, for example, that function arguments do not alias, that users are not 497taking the address of the function, and so on. Let's consider verifying usage 498site conditions to be a separate static analysis problem. 499 500### Lattice design 501 502To analyze the function body we can use a lattice which consists of normal 503states and failure states. A normal state describes program points where we are 504sure that no behaviors that block the refactoring have occurred. Normal states 505keep track of all parameter's member fields that are known to be overwritten on 506every path from function entry to the corresponding program point. Failure 507states accumulate observed violations (unsafe reads and pointer escapes) that 508block the refactoring. 509 510In the partial order of the lattice failure states compare greater than normal 511states, which guarantees that they "win" when joined with normal states. Order 512between failure states is determined by inclusion relation on the set of 513accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order 514between normal states is determined by reversed inclusion relation on the set of 515overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of 516overwritten fields). 517 518 519 520To determine whether a statement reads or writes a field we can implement 521symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer 522dereference operator and `MemberExpr`s. 523 524### Using data flow results to identify output parameters 525 526Let's take a look at how we use data flow analysis to identify an output 527parameter. The refactoring can be safely done when the data flow algorithm 528computes a normal state with all of the fields proven to be overwritten in the 529exit basic block of the function. 530 531```c++ 532struct Customer { 533 int account_id; 534 std::string name; 535}; 536 537void GetCustomer(Customer* c) { 538 // Overwritten: {} 539 c->account_id = ...; // Overwritten: {c->account_id} 540 if (...) { 541 c->name = ...; // Overwritten: {c->account_id, c->name} 542 } else { 543 c->name = ...; // Overwritten: {c->account_id, c->name} 544 } 545 // Overwritten: {c->account_id, c->name} 546} 547``` 548 549When the data flow algorithm computes a normal state, but not all fields are 550proven to be overwritten we can't perform the refactoring. 551 552```c++ 553void target(bool b, Customer* c) { 554 // Overwritten: {} 555 if (b) { 556 c->account_id = 42; // Overwritten: {c->account_id} 557 } else { 558 c->name = "Konrad"; // Overwritten: {c->name} 559 } 560 // Overwritten: {} 561} 562``` 563 564Similarly, when the data flow algorithm computes a failure state, we also can't 565perform the refactoring. 566 567```c++ 568Customer* kGlobalCustomer; 569 570void GetCustomer(Customer* c) { 571 // Overwritten: {} 572 c->account_id = ...; // Overwritten: {c->account_id} 573 if (...) { 574 print(c->name); // Unsafe read 575 } else { 576 kGlobalCustomer = c; // Pointer escape 577 } 578 // Unsafe read, Pointer escape 579} 580``` 581 582## Example: finding dead stores 583 584Let's say we want to find redundant stores, because they indicate potential 585bugs. 586 587```c++ 588x = GetX(); 589x = GetY(); 590``` 591 592The first store to `x` is never read, probably there is a bug. 593 594The implementation of dead store analysis is very similar to output parameter 595analysis: we need to track stores and loads, and find stores that were never 596read. 597 598[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a 599generalization of this idea, which is often used to answer many related 600questions, for example: 601 602* finding dead stores, 603* finding uninitialized variables, 604* finding a good point to deallocate memory, 605* finding out if it would be safe to move an object. 606 607## Example: definitive initialization 608 609Definitive initialization proves that variables are known to be initialized when 610read. If we find a variable which is read when not initialized then we generate 611a warning. 612 613```c++ 614void Init() { 615 int x; // x is uninitialized 616 if (cond()) { 617 x = 10; // x is initialized 618 } else { 619 x = 20; // x is initialized 620 } 621 print(x); // x is initialized 622} 623``` 624 625```c++ 626void Uninit() { 627 int x; // x is uninitialized 628 if (cond()) { 629 x = 10; // x is initialized 630 } 631 print(x); // x is maybe uninitialized, x is being read, report a bug. 632} 633``` 634 635For this purpose we can use lattice in a form of a mapping from variable 636declarations to initialization states; each initialization state is represented 637by the following lattice: 638 639 640 641A lattice element could also capture the source locations of the branches that 642lead us to the corresponding program point. Diagnostics would use this 643information to show a sample buggy code path to the user. 644 645## Example: refactoring raw pointers to `unique_ptr` 646 647Modern idiomatic C++ uses smart pointers to express memory ownership, however in 648pre-C++11 code one can often find raw pointers that own heap memory blocks. 649 650Imagine that we would like to refactor raw pointers that own memory to 651`unique_ptr`. There are multiple ways to design a data flow analysis for this 652problem; let's look at one way to do it. 653 654For example, we would like to refactor the following code that uses raw 655pointers: 656 657```c++ 658void UniqueOwnership1() { 659 int *pi = new int; 660 if (...) { 661 Borrow(pi); 662 delete pi; 663 } else { 664 TakeOwnership(pi); 665 } 666} 667``` 668 669into code that uses `unique_ptr`: 670 671```c++ 672void UniqueOwnership1() { 673 auto pi = std::make_unique<int>(); 674 if (...) { 675 Borrow(pi.get()); 676 } else { 677 TakeOwnership(pi.release()); 678 } 679} 680``` 681 682This problem can be solved with a lattice in form of map from value declarations 683to pointer states: 684 685 686 687We can perform the refactoring if at the exit of a function `pi` is 688`Compatible`. 689 690```c++ 691void UniqueOwnership1() { 692 int *pi; // pi is Compatible 693 pi = new int; // pi is Defined 694 if (...) { 695 Borrow(pi); // pi is Defined 696 delete pi; // pi is Compatible 697 } else { 698 TakeOwnership(pi); // pi is Compatible 699 } 700 // pi is Compatible 701} 702``` 703 704Let's look at an example where the raw pointer owns two different memory blocks: 705 706```c++ 707void UniqueOwnership2() { 708 int *pi = new int; // pi is Defined 709 Borrow(pi); 710 delete pi; // pi is Compatible 711 if (smth) { 712 pi = new int; // pi is Defined 713 Borrow(pi); 714 delete pi; // pi is Compatible 715 } 716 // pi is Compatible 717} 718``` 719 720It can be refactored to use `unique_ptr` like this: 721 722```c++ 723void UniqueOwnership2() { 724 auto pi = make_unique<int>(); 725 Borrow(pi); 726 if (smth) { 727 pi = make_unique<int>(); 728 Borrow(pi); 729 } 730} 731``` 732 733In the following example, the raw pointer is used to access the heap object 734after the ownership has been transferred. 735 736```c++ 737void UniqueOwnership3() { 738 int *pi = new int; // pi is Defined 739 if (...) { 740 Borrow(pi); 741 delete pi; // pi is Compatible 742 } else { 743 vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible 744 print(*pi); 745 use(v); 746 } 747 // pi is Compatible 748} 749``` 750 751We can refactor this code to use `unique_ptr`, however we would have to 752introduce a non-owning pointer variable, since we can't use the moved-from 753`unique_ptr` to access the object: 754 755```c++ 756void UniqueOwnership3() { 757 std::unique_ptr<int> pi = std::make_unique<int>(); 758 if (...) { 759 Borrow(pi); 760 } else { 761 int *pi_non_owning = pi.get(); 762 vector<unique_ptr<int>> v = {std::move(pi)}; 763 print(*pi_non_owning); 764 use(v); 765 } 766} 767``` 768 769If the original code didn't call `delete` at the very end of the function, then 770our refactoring may change the point at which we run the destructor and release 771memory. Specifically, if there is some user code after `delete`, then extending 772the lifetime of the object until the end of the function may hold locks for 773longer than necessary, introduce memory overhead etc. 774 775One solution is to always replace `delete` with a call to `reset()`, and then 776perform another analysis that removes unnecessary `reset()` calls. 777 778```c++ 779void AddedMemoryOverhead() { 780 HugeObject *ho = new HugeObject(); 781 use(ho); 782 delete ho; // Release the large amount of memory quickly. 783 LongRunningFunction(); 784} 785``` 786 787This analysis will refuse to refactor code that mixes borrowed pointer values 788and unique ownership. In the following code, `GetPtr()` returns a borrowed 789pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned 790pointer. We don't distinguish between these two assignments, and we want each 791assignment to be paired with a corresponding sink; otherwise, we transition the 792pointer to a `Conflicting` state, like in this example. 793 794```c++ 795void ConflictingOwnership() { 796 int *pi; // pi is Compatible 797 pi = GetPtr(); // pi is Defined 798 Borrow(pi); // pi is Defined 799 800 pi = new int; // pi is Conflicting 801 Borrow(pi); 802 delete pi; 803 // pi is Conflicting 804} 805``` 806 807We could still handle this case by finding a maximal range in the code where 808`pi` could be in the Compatible state, and only refactoring that part. 809 810```c++ 811void ConflictingOwnership() { 812 int *pi; 813 pi = GetPtr(); 814 Borrow(pi); 815 816 std::unique_ptr<int> pi_unique = std::make_unique<int>(); 817 Borrow(pi_unique.get()); 818} 819``` 820 821## Example: finding redundant branch conditions 822 823In the code below `b1` should not be checked in both the outer and inner "if" 824statements. It is likely there is a bug in this code. 825 826```c++ 827int F(bool b1, bool b2) { 828 if (b1) { 829 f(); 830 if (b1 && b2) { // Check `b1` again -- unnecessary! 831 g(); 832 } 833 } 834} 835``` 836 837A checker that finds this pattern syntactically is already implemented in 838ClangTidy using AST matchers (`bugprone-redundant-branch-condition`). 839 840To implement it using the data flow analysis framework, we can produce a warning 841if any part of the branch condition is implied by the flow condition. 842 843```c++ 844int F(bool b1, bool b2) { 845 // Flow condition: true. 846 if (b1) { 847 // Flow condition: b1. 848 f(); 849 if (b1 && b2) { // `b1` is implied by the flow condition. 850 g(); 851 } 852 } 853} 854``` 855 856One way to check this implication is to use a SAT solver. Without a SAT solver, 857we could keep the flow condition in the CNF form and then it would be easy to 858check the implication. 859 860## Example: finding unchecked `std::optional` unwraps 861 862Calling `optional::value()` is only valid if `optional::has_value()` is true. We 863want to show that when `x.value()` is executed, the flow condition implies 864`x.has_value()`. 865 866In the example below `x.value()` is accessed safely because it is guarded by the 867`x.has_value()` check. 868 869```c++ 870void Example(std::optional<int> &x) { 871 if (x.has_value()) { 872 use(x.value()); 873 } 874} 875``` 876 877While entering the if branch we deduce that `x.has_value()` is implied by the 878flow condition. 879 880```c++ 881void Example(std::optional<int> x) { 882 // Flow condition: true. 883 if (x.has_value()) { 884 // Flow condition: x.has_value() == true. 885 use(x.value()); 886 } 887 // Flow condition: true. 888} 889``` 890 891We also need to prove that `x` is not modified between check and value access. 892The modification of `x` may be very subtle: 893 894```c++ 895void F(std::optional<int> &x); 896 897void Example(std::optional<int> &x) { 898 if (x.has_value()) { 899 // Flow condition: x.has_value() == true. 900 unknown_function(x); // may change x. 901 // Flow condition: true. 902 use(x.value()); 903 } 904} 905``` 906 907## Example: finding dead code behind A/B experiment flags 908 909Finding dead code is a classic application of data flow analysis. 910 911Unused flags for A/B experiment hide dead code. However, this flavor of dead 912code is invisible to the compiler because the flag can be turned on at any 913moment. 914 915We could make a tool that deletes experiment flags. The user tells us which flag 916they want to delete, and we assume that the it's value is a given constant. 917 918For example, the user could use the tool to remove `example_flag` from this 919code: 920 921```c++ 922DEFINE_FLAG(std::string, example_flag, "", "A sample flag."); 923 924void Example() { 925 bool x = GetFlag(FLAGS_example_flag).empty(); 926 f(); 927 if (x) { 928 g(); 929 } else { 930 h(); 931 } 932} 933``` 934 935The tool would simplify the code to: 936 937```c++ 938void Example() { 939 f(); 940 g(); 941} 942``` 943 944We can solve this problem with a classic constant propagation lattice combined 945with symbolic evaluation. 946 947## Example: finding inefficient usages of associative containers 948 949Real-world code often accidentally performs repeated lookups in associative 950containers: 951 952```c++ 953map<int, Employee> xs; 954xs[42]->name = "..."; 955xs[42]->title = "..."; 956``` 957 958To find the above inefficiency we can use the available expressions analysis to 959understand that `m[42]` is evaluated twice. 960 961```c++ 962map<int, Employee> xs; 963Employee &e = xs[42]; 964e->name = "..."; 965e->title = "..."; 966``` 967 968We can also track the `m.contains()` check in the flow condition to find 969redundant checks, like in the example below. 970 971```c++ 972std::map<int, Employee> xs; 973if (!xs.contains(42)) { 974 xs.insert({42, someEmployee}); 975} 976``` 977 978## Example: refactoring types that implicitly convert to each other 979 980Refactoring one strong type to another is difficult, but the compiler can help: 981once you refactor one reference to the type, the compiler will flag other places 982where this information flows with type mismatch errors. Unfortunately this 983strategy does not work when you are refactoring types that implicitly convert to 984each other, for example, replacing `int32_t` with `int64_t`. 985 986Imagine that we want to change user IDs from 32 to 64-bit integers. In other 987words, we need to find all integers tainted with user IDs. We can use data flow 988analysis to implement taint analysis. 989 990```c++ 991void UseUser(int32_t user_id) { 992 int32_t id = user_id; 993 // Variable `id` is tainted with a user ID. 994 ... 995} 996``` 997 998Taint analysis is very well suited to this problem because the program rarely 999branches on user IDs, and almost certainly does not perform any computation 1000(like arithmetic). 1001