1*e5dd7070Spatrick================ 2*e5dd7070SpatrickInitializer List 3*e5dd7070Spatrick================ 4*e5dd7070SpatrickThis discussion took place in https://reviews.llvm.org/D35216 5*e5dd7070Spatrick"Escape symbols when creating std::initializer_list". 6*e5dd7070Spatrick 7*e5dd7070SpatrickIt touches problems of modelling C++ standard library constructs in general, 8*e5dd7070Spatrickincluding modelling implementation-defined fields within C++ standard library 9*e5dd7070Spatrickobjects, in particular constructing objects into pointers held by such fields, 10*e5dd7070Spatrickand separation of responsibilities between analyzer's core and checkers. 11*e5dd7070Spatrick 12*e5dd7070Spatrick**Artem:** 13*e5dd7070Spatrick 14*e5dd7070SpatrickI've seen a few false positives that appear because we construct 15*e5dd7070SpatrickC++11 std::initializer_list objects with brace initializers, and such 16*e5dd7070Spatrickconstruction is not properly modeled. For instance, if a new object is 17*e5dd7070Spatrickconstructed on the heap only to be put into a brace-initialized STL container, 18*e5dd7070Spatrickthe object is reported to be leaked. 19*e5dd7070Spatrick 20*e5dd7070SpatrickApproach (0): This can be trivially fixed by this patch, which causes pointers 21*e5dd7070Spatrickpassed into initializer list expressions to immediately escape. 22*e5dd7070Spatrick 23*e5dd7070SpatrickThis fix is overly conservative though. So i did a bit of investigation as to 24*e5dd7070Spatrickhow model std::initializer_list better. 25*e5dd7070Spatrick 26*e5dd7070SpatrickAccording to the standard, ``std::initializer_list<T>`` is an object that has 27*e5dd7070Spatrickmethods ``begin(), end(), and size()``, where ``begin()`` returns a pointer to continuous 28*e5dd7070Spatrickarray of ``size()`` objects of type T, and end() is equal to begin() plus size(). 29*e5dd7070SpatrickThe standard does hint that it should be possible to implement 30*e5dd7070Spatrick``std::initializer_list<T>`` as a pair of pointers, or as a pointer and a size 31*e5dd7070Spatrickinteger, however specific fields that the object would contain are an 32*e5dd7070Spatrickimplementation detail. 33*e5dd7070Spatrick 34*e5dd7070SpatrickIdeally, we should be able to model the initializer list's methods precisely. 35*e5dd7070SpatrickOr, at least, it should be possible to explain to the analyzer that the list 36*e5dd7070Spatricksomehow "takes hold" of the values put into it. Initializer lists can also be 37*e5dd7070Spatrickcopied, which is a separate story that i'm not trying to address here. 38*e5dd7070Spatrick 39*e5dd7070SpatrickThe obvious approach to modeling ``std::initializer_list`` in a checker would be to 40*e5dd7070Spatrickconstruct a SymbolMetadata for the memory region of the initializer list object, 41*e5dd7070Spatrickwhich would be of type ``T*`` and represent ``begin()``, so we'd trivially model ``begin()`` 42*e5dd7070Spatrickas a function that returns this symbol. The array pointed to by that symbol 43*e5dd7070Spatrickwould be ``bindLoc()``ed to contain the list's contents (probably as a ``CompoundVal`` 44*e5dd7070Spatrickto produce less bindings in the store). Extent of this array would represent 45*e5dd7070Spatrick``size()`` and would be equal to the length of the list as written. 46*e5dd7070Spatrick 47*e5dd7070SpatrickSo this sounds good, however apparently it does nothing to address our false 48*e5dd7070Spatrickpositives: when the list escapes, our ``RegionStoreManager`` is not magically 49*e5dd7070Spatrickguessing that the metadata symbol attached to it, together with its contents, 50*e5dd7070Spatrickshould also escape. In fact, it's impossible to trigger a pointer escape from 51*e5dd7070Spatrickwithin the checker. 52*e5dd7070Spatrick 53*e5dd7070SpatrickApproach (1): If only we enabled ``ProgramState::bindLoc(..., notifyChanges=true)`` 54*e5dd7070Spatrickto cause pointer escapes (not only region changes) (which sounds like the right 55*e5dd7070Spatrickthing to do anyway) such checker would be able to solve the false positives by 56*e5dd7070Spatricktriggering escapes when binding list elements to the list. However, it'd be as 57*e5dd7070Spatrickconservative as the current patch's solution. Ideally, we do not want escapes to 58*e5dd7070Spatrickhappen so early. Instead, we'd prefer them to be delayed until the list itself 59*e5dd7070Spatrickescapes. 60*e5dd7070Spatrick 61*e5dd7070SpatrickSo i believe that escaping metadata symbols whenever their base regions escape 62*e5dd7070Spatrickwould be the right thing to do. Currently we didn't think about that because we 63*e5dd7070Spatrickhad neither pointer-type metadatas nor non-pointer escapes. 64*e5dd7070Spatrick 65*e5dd7070SpatrickApproach (2): We could teach the Store to scan itself for bindings to 66*e5dd7070Spatrickmetadata-symbolic-based regions during scanReachableSymbols() whenever a region 67*e5dd7070Spatrickturns out to be reachable. This requires no work on checker side, but it sounds 68*e5dd7070Spatrickperformance-heavy. 69*e5dd7070Spatrick 70*e5dd7070SpatrickApproach (3): We could let checkers maintain the set of active metadata symbols 71*e5dd7070Spatrickin the program state (ideally somewhere in the Store, which sounds weird but 72*e5dd7070Spatrickcauses the smallest amount of layering violations), so that the core knew what 73*e5dd7070Spatrickto escape. This puts a stress on the checkers, but with a smart data map it 74*e5dd7070Spatrickwouldn't be a problem. 75*e5dd7070Spatrick 76*e5dd7070SpatrickApproach (4): We could allow checkers to trigger pointer escapes in arbitrary 77*e5dd7070Spatrickmoments. If we allow doing this within ``checkPointerEscape`` callback itself, we 78*e5dd7070Spatrickwould be able to express facts like "when this region escapes, that metadata 79*e5dd7070Spatricksymbol attached to it should also escape". This sounds like an ultimate freedom, 80*e5dd7070Spatrickwith maximum stress on the checkers - still not too much stress when we have 81*e5dd7070Spatricksmart data maps. 82*e5dd7070Spatrick 83*e5dd7070SpatrickI'm personally liking the approach (2) - it should be possible to avoid 84*e5dd7070Spatrickperformance overhead, and clarity seems nice. 85*e5dd7070Spatrick 86*e5dd7070Spatrick**Gabor:** 87*e5dd7070Spatrick 88*e5dd7070SpatrickAt this point, I am a bit wondering about two questions. 89*e5dd7070Spatrick 90*e5dd7070Spatrick* When should something belong to a checker and when should something belong to the engine? 91*e5dd7070Spatrick Sometimes we model library aspects in the engine and model language constructs in checkers. 92*e5dd7070Spatrick 93*e5dd7070Spatrick* What is the checker programming model that we are aiming for? Maximum freedom or more easy checker development? 94*e5dd7070Spatrick 95*e5dd7070SpatrickI think if we aim for maximum freedom, we do not need to worry about the 96*e5dd7070Spatrickpotential stress on checkers, and we can introduce abstractions to mitigate that 97*e5dd7070Spatricklater on. 98*e5dd7070SpatrickIf we want to simplify the API, then maybe it makes more sense to move language 99*e5dd7070Spatrickconstruct modeling to the engine when the checker API is not sufficient instead 100*e5dd7070Spatrickof complicating the API. 101*e5dd7070Spatrick 102*e5dd7070SpatrickRight now I have no preference or objections between the alternatives but there 103*e5dd7070Spatrickare some random thoughts: 104*e5dd7070Spatrick 105*e5dd7070Spatrick* Maybe it would be great to have a guideline how to evolve the analyzer and 106*e5dd7070Spatrick follow it, so it can help us to decide in similar situations 107*e5dd7070Spatrick 108*e5dd7070Spatrick* I do care about performance in this case. The reason is that we have a 109*e5dd7070Spatrick limited performance budget. And I think we should not expect most of the checker 110*e5dd7070Spatrick writers to add modeling of language constructs. So, in my opinion, it is ok to 111*e5dd7070Spatrick have less nice/more verbose API for language modeling if we can have better 112*e5dd7070Spatrick performance this way, since it only needs to be done once, and is done by the 113*e5dd7070Spatrick framework developers. 114*e5dd7070Spatrick 115*e5dd7070Spatrick**Artem:** These are some great questions, i guess it'd be better to discuss 116*e5dd7070Spatrickthem more openly. As a quick dump of my current mood: 117*e5dd7070Spatrick 118*e5dd7070Spatrick* To me it seems obvious that we need to aim for a checker API that is both 119*e5dd7070Spatrick simple and powerful. This can probably by keeping the API as powerful as 120*e5dd7070Spatrick necessary while providing a layer of simple ready-made solutions on top of it. 121*e5dd7070Spatrick Probably a few reusable components for assembling checkers. And this layer 122*e5dd7070Spatrick should ideally be pleasant enough to work with, so that people would prefer to 123*e5dd7070Spatrick extend it when something is lacking, instead of falling back to the complex 124*e5dd7070Spatrick omnipotent API. I'm thinking of AST matchers vs. AST visitors as a roughly 125*e5dd7070Spatrick similar situation: matchers are not omnipotent, but they're so nice. 126*e5dd7070Spatrick 127*e5dd7070Spatrick* Separation between core and checkers is usually quite strange. Once we have 128*e5dd7070Spatrick shared state traits, i generally wouldn't mind having region store or range 129*e5dd7070Spatrick constraint manager as checkers (though it's probably not worth it to transform 130*e5dd7070Spatrick them - just a mood). The main thing to avoid here would be the situation when 131*e5dd7070Spatrick the checker overwrites stuff written by the core because it thinks it has a 132*e5dd7070Spatrick better idea what's going on, so the core should provide a good default behavior. 133*e5dd7070Spatrick 134*e5dd7070Spatrick* Yeah, i totally care about performance as well, and if i try to implement 135*e5dd7070Spatrick approach, i'd make sure it's good. 136*e5dd7070Spatrick 137*e5dd7070Spatrick**Artem:** 138*e5dd7070Spatrick 139*e5dd7070Spatrick> Approach (2): We could teach the Store to scan itself for bindings to 140*e5dd7070Spatrick> metadata-symbolic-based regions during scanReachableSymbols() whenever 141*e5dd7070Spatrick> a region turns out to be reachable. This requires no work on checker side, 142*e5dd7070Spatrick> but it sounds performance-heavy. 143*e5dd7070Spatrick 144*e5dd7070SpatrickNope, this approach is wrong. Metadata symbols may become out-of-date: when the 145*e5dd7070Spatrickobject changes, metadata symbols attached to it aren't changing (because symbols 146*e5dd7070Spatricksimply don't change). The same metadata may have different symbols to denote its 147*e5dd7070Spatrickvalue in different moments of time, but at most one of them represents the 148*e5dd7070Spatrickactual metadata value. So we'd be escaping more stuff than necessary. 149*e5dd7070Spatrick 150*e5dd7070SpatrickIf only we had "ghost fields" 151*e5dd7070Spatrick(https://lists.llvm.org/pipermail/cfe-dev/2016-May/049000.html), it would have 152*e5dd7070Spatrickbeen much easier, because the ghost field would only contain the actual 153*e5dd7070Spatrickmetadata, and the Store would always know about it. This example adds to my 154*e5dd7070Spatrickbelief that ghost fields are exactly what we need for most C++ checkers. 155*e5dd7070Spatrick 156*e5dd7070Spatrick**Devin:** 157*e5dd7070Spatrick 158*e5dd7070SpatrickIn this case, I would be fine with some sort of 159*e5dd7070SpatrickAbstractStorageMemoryRegion that meant "here is a memory region and somewhere 160*e5dd7070Spatrickreachable from here exists another region of type T". Or even multiple regions 161*e5dd7070Spatrickwith different identifiers. This wouldn't specify how the memory is reachable, 162*e5dd7070Spatrickbut it would allow for transfer functions to get at those regions and it would 163*e5dd7070Spatrickallow for invalidation. 164*e5dd7070Spatrick 165*e5dd7070SpatrickFor ``std::initializer_list`` this reachable region would the region for the backing 166*e5dd7070Spatrickarray and the transfer functions for begin() and end() yield the beginning and 167*e5dd7070Spatrickend element regions for it. 168*e5dd7070Spatrick 169*e5dd7070SpatrickIn my view this differs from ghost variables in that (1) this storage does 170*e5dd7070Spatrickactually exist (it is just a library implementation detail where that storage 171*e5dd7070Spatricklives) and (2) it is perfectly valid for a pointer into that storage to be 172*e5dd7070Spatrickreturned and for another part of the program to read or write from that storage. 173*e5dd7070Spatrick(Well, in this case just read since it is allowed to be read-only memory). 174*e5dd7070Spatrick 175*e5dd7070SpatrickWhat I'm not OK with is modeling abstract analysis state (for example, the count 176*e5dd7070Spatrickof a NSMutableArray or the typestate of a file handle) as a value stored in some 177*e5dd7070Spatrickginned up region in the store. This takes an easy problem that the analyzer does 178*e5dd7070Spatrickwell at (modeling typestate) and turns it into a hard one that the analyzer is 179*e5dd7070Spatrickbad at (reasoning about the contents of the heap). 180*e5dd7070Spatrick 181*e5dd7070SpatrickI think the key criterion here is: "is the region accessible from outside the 182*e5dd7070Spatricklibrary". That is, does the library expose the region as a pointer that can be 183*e5dd7070Spatrickread to or written from in the client program? If so, then it makes sense for 184*e5dd7070Spatrickthis to be in the store: we are modeling reachable storage as storage. But if 185*e5dd7070Spatrickwe're just modeling arbitrary analysis facts that need to be invalidated when a 186*e5dd7070Spatrickpointer escapes then we shouldn't try to gin up storage for them just to get 187*e5dd7070Spatrickinvalidation for free. 188*e5dd7070Spatrick 189*e5dd7070Spatrick**Artem:** 190*e5dd7070Spatrick 191*e5dd7070Spatrick> In this case, I would be fine with some sort of ``AbstractStorageMemoryRegion`` 192*e5dd7070Spatrick> that meant "here is a memory region and somewhere reachable from here exists 193*e5dd7070Spatrick> another region of type T". Or even multiple regions with different 194*e5dd7070Spatrick> identifiers. This wouldn't specify how the memory is reachable, but it would 195*e5dd7070Spatrick> allow for transfer functions to get at those regions and it would allow for 196*e5dd7070Spatrick> invalidation. 197*e5dd7070Spatrick 198*e5dd7070SpatrickYeah, this is what we can easily implement now as a 199*e5dd7070Spatricksymbolic-region-based-on-a-metadata-symbol (though we can make a new region 200*e5dd7070Spatrickclass for that if we eg. want it typed). The problem is that the relation 201*e5dd7070Spatrickbetween such storage region and its parent object region is essentially 202*e5dd7070Spatrickimmaterial, similarly to the relation between ``SymbolRegionValue`` and its parent 203*e5dd7070Spatrickregion. Region contents are mutable: today the abstract storage is reachable 204*e5dd7070Spatrickfrom its parent object, tomorrow it's not, and maybe something else becomes 205*e5dd7070Spatrickreachable, something that isn't even abstract. So the parent region for the 206*e5dd7070Spatrickabstract storage is most of the time at best a "nice to know" thing - we cannot 207*e5dd7070Spatrickrely on it to do any actual work. We'd anyway need to rely on the checker to do 208*e5dd7070Spatrickthe job. 209*e5dd7070Spatrick 210*e5dd7070Spatrick> For std::initializer_list this reachable region would the region for the 211*e5dd7070Spatrick> backing array and the transfer functions for begin() and end() yield the 212*e5dd7070Spatrick> beginning and end element regions for it. 213*e5dd7070Spatrick 214*e5dd7070SpatrickSo maybe in fact for std::initializer_list it may work fine because you cannot 215*e5dd7070Spatrickchange the data after the object is constructed - so this region's contents are 216*e5dd7070Spatrickessentially immutable. For the future, i feel as if it is a dead end. 217*e5dd7070Spatrick 218*e5dd7070SpatrickI'd like to consider another funny example. Suppose we're trying to model 219*e5dd7070Spatrick 220*e5dd7070Spatrick.. code-block:: cpp 221*e5dd7070Spatrick 222*e5dd7070Spatrick std::unique_ptr. Consider:: 223*e5dd7070Spatrick 224*e5dd7070Spatrick void bar(const std::unique_ptr<int> &x); 225*e5dd7070Spatrick 226*e5dd7070Spatrick void foo(std::unique_ptr<int> &x) { 227*e5dd7070Spatrick int *a = x.get(); // (a, 0, direct): &AbstractStorageRegion 228*e5dd7070Spatrick *a = 1; // (AbstractStorageRegion, 0, direct): 1 S32b 229*e5dd7070Spatrick int *b = new int; 230*e5dd7070Spatrick *b = 2; // (SymRegion{conj_$0<int *>}, 0 ,direct): 2 S32b 231*e5dd7070Spatrick x.reset(b); // Checker map: x -> SymRegion{conj_$0<int *>} 232*e5dd7070Spatrick bar(x); // 'a' doesn't escape (the pointer was unique), 'b' does. 233*e5dd7070Spatrick clang_analyzer_eval(*a == 1); // Making this true is up to the checker. 234*e5dd7070Spatrick clang_analyzer_eval(*b == 2); // Making this unknown is up to the checker. 235*e5dd7070Spatrick } 236*e5dd7070Spatrick 237*e5dd7070SpatrickThe checker doesn't totally need to ensure that ``*a == 1`` passes - even though the 238*e5dd7070Spatrickpointer was unique, it could theoretically have ``.get()``-ed above and the code 239*e5dd7070Spatrickcould of course break the uniqueness invariant (though we'd probably want it). 240*e5dd7070SpatrickThe checker can say that "even if ``*a`` did escape, it was not because it was 241*e5dd7070Spatrickstuffed directly into bar()". 242*e5dd7070Spatrick 243*e5dd7070SpatrickThe checker's direct responsibility, however, is to solve the ``*b == 2`` thing 244*e5dd7070Spatrick(which is in fact the problem we're dealing with in this patch - escaping the 245*e5dd7070Spatrickstorage region of the object). 246*e5dd7070Spatrick 247*e5dd7070SpatrickSo we're talking about one more operation over the program state (scanning 248*e5dd7070Spatrickreachable symbols and regions) that cannot work without checker support. 249*e5dd7070Spatrick 250*e5dd7070SpatrickWe can probably add a new callback "checkReachableSymbols" to solve this. This 251*e5dd7070Spatrickis in fact also related to the dead symbols problem (we're scanning for live 252*e5dd7070Spatricksymbols in the store and in the checkers separately, but we need to do so 253*e5dd7070Spatricksimultaneously with a single worklist). Hmm, in fact this sounds like a good 254*e5dd7070Spatrickidea; we can replace checkLiveSymbols with checkReachableSymbols. 255*e5dd7070Spatrick 256*e5dd7070SpatrickOr we could just have ghost member variables, and no checker support required at 257*e5dd7070Spatrickall. For ghost member variables, the relation with their parent region (which 258*e5dd7070Spatrickwould be their superregion) is actually useful, the mutability of their contents 259*e5dd7070Spatrickis expressed naturally, and the store automagically sees reachable symbols, live 260*e5dd7070Spatricksymbols, escapes, invalidations, whatever. 261*e5dd7070Spatrick 262*e5dd7070Spatrick> In my view this differs from ghost variables in that (1) this storage does 263*e5dd7070Spatrick> actually exist (it is just a library implementation detail where that storage 264*e5dd7070Spatrick> lives) and (2) it is perfectly valid for a pointer into that storage to be 265*e5dd7070Spatrick> returned and for another part of the program to read or write from that 266*e5dd7070Spatrick> storage. (Well, in this case just read since it is allowed to be read-only 267*e5dd7070Spatrick> memory). 268*e5dd7070Spatrick 269*e5dd7070Spatrick> What I'm not OK with is modeling abstract analysis state (for example, the 270*e5dd7070Spatrick> count of a NSMutableArray or the typestate of a file handle) as a value stored 271*e5dd7070Spatrick> in some ginned up region in the store.This takes an easy problem that the 272*e5dd7070Spatrick> analyzer does well at (modeling typestate) and turns it into a hard one that 273*e5dd7070Spatrick> the analyzer is bad at (reasoning about the contents of the heap). 274*e5dd7070Spatrick 275*e5dd7070SpatrickYeah, i tend to agree on that. For simple typestates, this is probably an 276*e5dd7070Spatrickoverkill, so let's definitely put aside the idea of "ghost symbolic regions" 277*e5dd7070Spatrickthat i had earlier. 278*e5dd7070Spatrick 279*e5dd7070SpatrickBut, to summarize a bit, in our current case, however, the typestate we're 280*e5dd7070Spatricklooking for is the contents of the heap. And when we try to model such 281*e5dd7070Spatricktypestates (complex in this specific manner, i.e. heap-like) in any checker, we 282*e5dd7070Spatrickhave a choice between re-doing this modeling in every such checker (which is 283*e5dd7070Spatricksomething analyzer is indeed good at, but at a price of making checkers heavy) 284*e5dd7070Spatrickor instead relying on the Store to do exactly what it's designed to do. 285*e5dd7070Spatrick 286*e5dd7070Spatrick> I think the key criterion here is: "is the region accessible from outside 287*e5dd7070Spatrick> the library". That is, does the library expose the region as a pointer that 288*e5dd7070Spatrick> can be read to or written from in the client program? If so, then it makes 289*e5dd7070Spatrick> sense for this to be in the store: we are modeling reachable storage as 290*e5dd7070Spatrick> storage. But if we're just modeling arbitrary analysis facts that need to be 291*e5dd7070Spatrick> invalidated when a pointer escapes then we shouldn't try to gin up storage 292*e5dd7070Spatrick> for them just to get invalidation for free. 293*e5dd7070Spatrick 294*e5dd7070SpatrickAs a metaphor, i'd probably compare it to body farms - the difference between 295*e5dd7070Spatrickghost member variables and metadata symbols seems to me like the difference 296*e5dd7070Spatrickbetween body farms and evalCall. Both are nice to have, and body farms are very 297*e5dd7070Spatrickpleasant to work with, even if not omnipotent. I think it's fine for a 298*e5dd7070SpatrickFunctionDecl's body in a body farm to have a local variable, even if such 299*e5dd7070Spatrickvariable doesn't actually exist, even if it cannot be seen from outside the 300*e5dd7070Spatrickfunction call. I'm not seeing immediate practical difference between "it does 301*e5dd7070Spatrickactually exist" and "it doesn't actually exist, just a handy abstraction". 302*e5dd7070SpatrickSimilarly, i think it's fine if we have a ``CXXRecordDecl`` with 303*e5dd7070Spatrickimplementation-defined contents, and try to farm up a member variable as a handy 304*e5dd7070Spatrickabstraction (we don't even need to know its name or offset, only that it's there 305*e5dd7070Spatricksomewhere). 306*e5dd7070Spatrick 307*e5dd7070Spatrick**Artem:** 308*e5dd7070Spatrick 309*e5dd7070SpatrickWe've discussed it in person with Devin, and he provided more points to think 310*e5dd7070Spatrickabout: 311*e5dd7070Spatrick 312*e5dd7070Spatrick* If the initializer list consists of non-POD data, constructors of list's 313*e5dd7070Spatrick objects need to take the sub-region of the list's region as this-region In the 314*e5dd7070Spatrick current (v2) version of this patch, these objects are constructed elsewhere and 315*e5dd7070Spatrick then trivial-copied into the list's metadata pointer region, which may be 316*e5dd7070Spatrick incorrect. This is our overall problem with C++ constructors, which manifests in 317*e5dd7070Spatrick this case as well. Additionally, objects would need to be constructed in the 318*e5dd7070Spatrick analyzer's core, which would not be able to predict that it needs to take a 319*e5dd7070Spatrick checker-specific region as this-region, which makes it harder, though it might 320*e5dd7070Spatrick be mitigated by sharing the checker state traits. 321*e5dd7070Spatrick 322*e5dd7070Spatrick* Because "ghost variables" are not material to the user, we need to somehow 323*e5dd7070Spatrick make super sure that they don't make it into the diagnostic messages. 324*e5dd7070Spatrick 325*e5dd7070SpatrickSo, because this needs further digging into overall C++ support and rises too 326*e5dd7070Spatrickmany questions, i'm delaying a better approach to this problem and will fall 327*e5dd7070Spatrickback to the original trivial patch. 328