xref: /llvm-project/clang/docs/analyzer/developer-docs/RegionStore.rst (revision 1a17032b788016299ea4e3c4b53670c6dcd94b4f)
1*1a17032bSKristof Umann============
2*1a17032bSKristof UmannRegion Store
3*1a17032bSKristof Umann============
4*1a17032bSKristof UmannThe analyzer "Store" represents the contents of memory regions. It is an opaque
5*1a17032bSKristof Umannfunctional data structure stored in each ``ProgramState``; the only class that
6*1a17032bSKristof Umanncan modify the store is its associated StoreManager.
7*1a17032bSKristof Umann
8*1a17032bSKristof UmannCurrently (Feb. 2013), the only StoreManager implementation being used is
9*1a17032bSKristof Umann``RegionStoreManager``. This store records bindings to memory regions using a
10*1a17032bSKristof Umann"base region + offset" key. (This allows ``*p`` and ``p[0]`` to map to the same
11*1a17032bSKristof Umannlocation, among other benefits.)
12*1a17032bSKristof Umann
13*1a17032bSKristof UmannRegions are grouped into "clusters", which roughly correspond to "regions with
14*1a17032bSKristof Umannthe same base region". This allows certain operations to be more efficient,
15*1a17032bSKristof Umannsuch as invalidation.
16*1a17032bSKristof Umann
17*1a17032bSKristof UmannRegions that do not have a known offset use a special "symbolic" offset. These
18*1a17032bSKristof Umannkeys store both the original region, and the "concrete offset region" -- the
19*1a17032bSKristof Umannlast region whose offset is entirely concrete. (For example, in the expression
20*1a17032bSKristof Umann``foo.bar[1][i].baz``, the concrete offset region is the array ``foo.bar[1]``,
21*1a17032bSKristof Umannsince that has a known offset from the start of the top-level ``foo`` struct.)
22*1a17032bSKristof Umann
23*1a17032bSKristof Umann
24*1a17032bSKristof UmannBinding Invalidation
25*1a17032bSKristof Umann--------------------
26*1a17032bSKristof Umann
27*1a17032bSKristof UmannSupporting both concrete and symbolic offsets makes things a bit tricky. Here's
28*1a17032bSKristof Umannan example:
29*1a17032bSKristof Umann
30*1a17032bSKristof Umann.. code-block:: cpp
31*1a17032bSKristof Umann
32*1a17032bSKristof Umann  foo[0] = 0;
33*1a17032bSKristof Umann  foo[1] = 1;
34*1a17032bSKristof Umann  foo[i] = i;
35*1a17032bSKristof Umann
36*1a17032bSKristof UmannAfter the third assignment, nothing can be said about the value of ``foo[0]``,
37*1a17032bSKristof Umannbecause ``foo[i]`` may have overwritten it! Thus, *binding to a region with a
38*1a17032bSKristof Umannsymbolic offset invalidates the entire concrete offset region.* We know
39*1a17032bSKristof Umann``foo[i]`` is somewhere within ``foo``, so we don't have to invalidate
40*1a17032bSKristof Umannanything else, but we do have to be conservative about all other bindings within
41*1a17032bSKristof Umann``foo``.
42*1a17032bSKristof Umann
43*1a17032bSKristof UmannContinuing the example:
44*1a17032bSKristof Umann
45*1a17032bSKristof Umann.. code-block:: cpp
46*1a17032bSKristof Umann
47*1a17032bSKristof Umann  foo[i] = i;
48*1a17032bSKristof Umann  foo[0] = 0;
49*1a17032bSKristof Umann
50*1a17032bSKristof UmannAfter this latest assignment, nothing can be said about the value of ``foo[i]``,
51*1a17032bSKristof Umannbecause ``foo[0]`` may have overwritten it! *Binding to a region R with a
52*1a17032bSKristof Umannconcrete offset invalidates any symbolic offset bindings whose concrete offset
53*1a17032bSKristof Umannregion is a super-region **or** sub-region of R.* All we know about ``foo[i]``
54*1a17032bSKristof Umannis that it is somewhere within ``foo``, so changing *anything* within ``foo``
55*1a17032bSKristof Umannmight change ``foo[i]``, and changing *all* of ``foo`` (or its base region) will
56*1a17032bSKristof Umann*definitely* change ``foo[i]``.
57*1a17032bSKristof Umann
58*1a17032bSKristof UmannThis logic could be improved by using the current constraints on ``i``, at the
59*1a17032bSKristof Umanncost of speed. The latter case could also be improved by matching region kinds,
60*1a17032bSKristof Umanni.e. changing ``foo[0].a`` is unlikely to affect ``foo[i].b``, no matter what
61*1a17032bSKristof Umann``i`` is.
62*1a17032bSKristof Umann
63*1a17032bSKristof UmannFor more detail, read through ``RegionStoreManager::removeSubRegionBindings`` in
64*1a17032bSKristof UmannRegionStore.cpp.
65*1a17032bSKristof Umann
66*1a17032bSKristof Umann
67*1a17032bSKristof UmannObjCIvarRegions
68*1a17032bSKristof Umann---------------
69*1a17032bSKristof Umann
70*1a17032bSKristof UmannObjective-C instance variables require a bit of special handling. Like struct
71*1a17032bSKristof Umannfields, they are not base regions, and when their parent object region is
72*1a17032bSKristof Umanninvalidated, all the instance variables must be invalidated as well. However,
73*1a17032bSKristof Umannthey have no concrete compile-time offsets (in the modern, "non-fragile"
74*1a17032bSKristof Umannruntime), and so cannot easily be represented as an offset from the start of
75*1a17032bSKristof Umannthe object in the analyzer. Moreover, this means that invalidating a single
76*1a17032bSKristof Umanninstance variable should *not* invalidate the rest of the object, since unlike
77*1a17032bSKristof Umannstruct fields or array elements there is no way to perform pointer arithmetic
78*1a17032bSKristof Umannto access another instance variable.
79*1a17032bSKristof Umann
80*1a17032bSKristof UmannConsequently, although the base region of an ObjCIvarRegion is the entire
81*1a17032bSKristof Umannobject, RegionStore offsets are computed from the start of the instance
82*1a17032bSKristof Umannvariable. Thus it is not valid to assume that all bindings with non-symbolic
83*1a17032bSKristof Umannoffsets start from the base region!
84*1a17032bSKristof Umann
85*1a17032bSKristof Umann
86*1a17032bSKristof UmannRegion Invalidation
87*1a17032bSKristof Umann-------------------
88*1a17032bSKristof Umann
89*1a17032bSKristof UmannUnlike binding invalidation, region invalidation occurs when the entire
90*1a17032bSKristof Umanncontents of a region may have changed---say, because it has been passed to a
91*1a17032bSKristof Umannfunction the analyzer can model, like memcpy, or because its address has
92*1a17032bSKristof Umannescaped, usually as an argument to an opaque function call. In these cases we
93*1a17032bSKristof Umannneed to throw away not just all bindings within the region itself, but within
94*1a17032bSKristof Umannits entire cluster, since neighboring regions may be accessed via pointer
95*1a17032bSKristof Umannarithmetic.
96*1a17032bSKristof Umann
97*1a17032bSKristof UmannRegion invalidation typically does even more than this, however. Because it
98*1a17032bSKristof Umannusually represents the complete escape of a region from the analyzer's model,
99*1a17032bSKristof Umannits *contents* must also be transitively invalidated. (For example, if a region
100*1a17032bSKristof Umann``p`` of type ``int **`` is invalidated, the contents of ``*p`` and ``**p`` may
101*1a17032bSKristof Umannhave changed as well.) The algorithm that traverses this transitive closure of
102*1a17032bSKristof Umannaccessible regions is known as ClusterAnalysis, and is also used for finding
103*1a17032bSKristof Umannall live bindings in the store (in order to throw away the dead ones). The name
104*1a17032bSKristof Umann"ClusterAnalysis" predates the cluster-based organization of bindings, but
105*1a17032bSKristof Umannrefers to the same concept: during invalidation and liveness analysis, all
106*1a17032bSKristof Umannbindings within a cluster must be treated in the same way for a conservative
107*1a17032bSKristof Umannmodel of program behavior.
108*1a17032bSKristof Umann
109*1a17032bSKristof Umann
110*1a17032bSKristof UmannDefault Bindings
111*1a17032bSKristof Umann----------------
112*1a17032bSKristof Umann
113*1a17032bSKristof UmannMost bindings in RegionStore are simple scalar values -- integers and pointers.
114*1a17032bSKristof UmannThese are known as "Direct" bindings. However, RegionStore supports a second
115*1a17032bSKristof Umanntype of binding called a "Default" binding. These are used to provide values to
116*1a17032bSKristof Umannall the elements of an aggregate type (struct or array) without having to
117*1a17032bSKristof Umannexplicitly specify a binding for each individual element.
118*1a17032bSKristof Umann
119*1a17032bSKristof UmannWhen there is no Direct binding for a particular region, the store manager
120*1a17032bSKristof Umannlooks at each super-region in turn to see if there is a Default binding. If so,
121*1a17032bSKristof Umannthis value is used as the value of the original region. The search ends when
122*1a17032bSKristof Umannthe base region is reached, at which point the RegionStore will pick an
123*1a17032bSKristof Umannappropriate default value for the region (usually a symbolic value, but
124*1a17032bSKristof Umannsometimes zero, for static data, or "uninitialized", for stack variables).
125*1a17032bSKristof Umann
126*1a17032bSKristof Umann.. code-block:: cpp
127*1a17032bSKristof Umann
128*1a17032bSKristof Umann  int manyInts[10];
129*1a17032bSKristof Umann  manyInts[1] = 42;   // Creates a Direct binding for manyInts[1].
130*1a17032bSKristof Umann  print(manyInts[1]); // Retrieves the Direct binding for manyInts[1];
131*1a17032bSKristof Umann  print(manyInts[0]); // There is no Direct binding for manyInts[0].
132*1a17032bSKristof Umann                      // Is there a Default binding for the entire array?
133*1a17032bSKristof Umann                      // There is not, but it is a stack variable, so we use
134*1a17032bSKristof Umann                      // "uninitialized" as the default value (and emit a
135*1a17032bSKristof Umann                      // diagnostic!).
136*1a17032bSKristof Umann
137*1a17032bSKristof UmannNOTE: The fact that bindings are stored as a base region plus an offset limits
138*1a17032bSKristof Umannthe Default Binding strategy, because in C aggregates can contain other
139*1a17032bSKristof Umannaggregates. In the current implementation of RegionStore, there is no way to
140*1a17032bSKristof Umanndistinguish a Default binding for an entire aggregate from a Default binding
141*1a17032bSKristof Umannfor the sub-aggregate at offset 0.
142*1a17032bSKristof Umann
143*1a17032bSKristof Umann
144*1a17032bSKristof UmannLazy Bindings (LazyCompoundVal)
145*1a17032bSKristof Umann-------------------------------
146*1a17032bSKristof Umann
147*1a17032bSKristof UmannRegionStore implements an optimization for copying aggregates (structs and
148*1a17032bSKristof Umannarrays) called "lazy bindings", implemented using a special SVal called
149*1a17032bSKristof UmannLazyCompoundVal. When the store is asked for the "binding" for an entire
150*1a17032bSKristof Umannaggregate (i.e. for an lvalue-to-rvalue conversion), it returns a
151*1a17032bSKristof UmannLazyCompoundVal instead. When this value is then stored into a variable, it is
152*1a17032bSKristof Umannbound as a Default value. This makes copying arrays and structs much cheaper
153*1a17032bSKristof Umannthan if they had required memberwise access.
154*1a17032bSKristof Umann
155*1a17032bSKristof UmannUnder the hood, a LazyCompoundVal is implemented as a uniqued pair of (region,
156*1a17032bSKristof Umannstore), representing "the value of the region during this 'snapshot' of the
157*1a17032bSKristof Umannstore". This has important implications for any sort of liveness or
158*1a17032bSKristof Umannreachability analysis, which must take the bindings in the old store into
159*1a17032bSKristof Umannaccount.
160*1a17032bSKristof Umann
161*1a17032bSKristof UmannRetrieving a value from a lazy binding happens in the same way as any other
162*1a17032bSKristof UmannDefault binding: since there is no direct binding, the store manager falls back
163*1a17032bSKristof Umannto super-regions to look for an appropriate default binding. LazyCompoundVal
164*1a17032bSKristof Umanndiffers from a normal default binding, however, in that it contains several
165*1a17032bSKristof Umanndifferent values, instead of one value that will appear several times. Because
166*1a17032bSKristof Umannof this, the store manager has to reconstruct the subregion chain on top of the
167*1a17032bSKristof UmannLazyCompoundVal region, and look up *that* region in the previous store.
168*1a17032bSKristof Umann
169*1a17032bSKristof UmannHere's a concrete example:
170*1a17032bSKristof Umann
171*1a17032bSKristof Umann.. code-block:: cpp
172*1a17032bSKristof Umann
173*1a17032bSKristof Umann  CGPoint p;
174*1a17032bSKristof Umann  p.x = 42;       // A Direct binding is made to the FieldRegion 'p.x'.
175*1a17032bSKristof Umann  CGPoint p2 = p; // A LazyCompoundVal is created for 'p', along with a
176*1a17032bSKristof Umann                  // snapshot of the current store state. This value is then
177*1a17032bSKristof Umann                  // used as a Default binding for the VarRegion 'p2'.
178*1a17032bSKristof Umann  return p2.x;    // The binding for FieldRegion 'p2.x' is requested.
179*1a17032bSKristof Umann                  // There is no Direct binding, so we look for a Default
180*1a17032bSKristof Umann                  // binding to 'p2' and find the LCV.
181*1a17032bSKristof Umann                  // Because it's a LCV, we look at our requested region
182*1a17032bSKristof Umann                  // and see that it's the '.x' field. We ask for the value
183*1a17032bSKristof Umann                  // of 'p.x' within the snapshot, and get back 42.
184