1Data Formatters 2=============== 3 4This page is an introduction to the design of the LLDB data formatters 5subsystem. The intended target audience are people interested in understanding 6or modifying the formatters themselves rather than writing a specific data 7formatter. For the latter, refer to :doc:`/use/variable/`. 8 9This page also highlights some open areas for improvement to the general 10subsystem, and more evolutions not anticipated here are certainly possible. 11 12Overview 13-------- 14 15The LLDB data formatters subsystem is used to allow the debugger as well as the 16end-users to customize the way their variables look upon inspection in the user 17interface (be it the command line tool, or one of the several GUIs that are 18backed by LLDB). 19 20To this aim, they are hooked into the ``ValueObjects`` model, in order to 21provide entry points through which such customization questions can be 22answered. For example: What format should this number be printed as? How many 23child elements does this ``std::vector`` have? 24 25The architecture of the subsystem is layered, with the highest level layer 26being the user visible interaction features (e.g. the ``type ***`` commands, 27the SB classes, ...). Other layers of interest that will be analyzed in this 28document include: 29 30* Classes implementing individual data formatter types 31* Classes implementing formatters navigation, discovery and categorization 32* The ``FormatManager`` layer 33* The ``DataVisualization`` layer 34* The SWIG <> LLDB communication layer 35 36Data Formatter Types 37-------------------- 38 39As described in the user documentation, there are four types of formatters: 40 41* Formats 42* Summaries 43* Filters 44* Synthetic children 45 46Formatters have descriptor classes, ``Type*Impl``, which contain at least a 47"Flags" nested object, which contains both rules to be used by the matching 48algorithm (e.g. should the formatter for type Foo apply to a Foo*?) or rules to 49be used by the formatter itself (e.g. is this summary a oneliner?). 50 51Individual formatter descriptor classes then also contain data items useful to 52them for performing their functionality. For instance ``TypeFormatImpl`` 53(backing formats) contains an ``lldb::Format`` that is the format to then be 54applied were this formatter to be selected. Upon issuing a ``type format add`` 55a new ``TypeFormatImpl`` is created that wraps the user-specified format, and 56matching options: 57 58:: 59 60 entry.reset(new TypeFormatImpl( 61 format, TypeFormatImpl::Flags() 62 .SetCascades(m_command_options.m_cascade) 63 .SetSkipPointers(m_command_options.m_skip_pointers) 64 .SetSkipReferences(m_command_options.m_skip_references))); 65 66 67While formats are fairly simple and only implemented by one class, the other 68formatter types are backed by a class hierarchy. 69 70Summaries, for instance, can exist in one of three "flavors": 71 72* Summary strings 73* Python script 74* Native C++ 75 76The base class for summaries, ``TypeSummaryImpl``, is a pure virtual class that 77wraps, again, the Flags, and exports among others: 78 79:: 80 81 virtual bool FormatObject (ValueObject *valobj, std::string& dest) = 0; 82 83 84This is the core entry point, which allows subclasses to specify their mode of 85operation. 86 87``StringSummaryFormat``, which is the class that implements summary strings, 88does a check as to whether the summary is a one-liner, and if not, then uses 89its stored summary string to call into ``Debugger::FormatPrompt``, and obtain a 90string back, which it returns in ``dest`` as the resulting summary. 91 92For a Python summary, implemented in ``ScriptSummaryFormat``, 93``FormatObject()`` calls into the ``ScriptInterpreter`` which is supposed to 94hold the knowledge on how to bridge back and forth with the scripting language 95(Python in the case of LLDB) in order to produce a valid string. Implementors 96of new ``ScriptInterpreters`` for other languages are expected to provide a 97``GetScriptedSummary()`` entry point for this purpose, if they desire to allow 98users to provide formatters in the new language 99 100Lastly, C++ summaries (``CXXFunctionSummaryFormat``), wrap a function pointer 101and call into it to execute their duty. It should be noted that there are no 102facilities for users to interact with C++ formatters, and as such they are 103extremely opaque, effectively being a thin wrapper between plain function 104pointers and the LLDB formatters subsystem. 105 106Also, dynamic loading of C++ formatters in LLDB is currently not implemented, 107and as such it is safe and reasonable for these formatters to deal with 108internal ``ValueObjects`` instances instead of public ``SBValue`` objects. 109 110An interesting data point is that summaries are expected to be stateless. While 111at the Python layer they are handed an ``SBValue`` (since nothing else could be 112visible for scripts), it is not expected that the ``SBValue`` should be cached 113and reused - any and all caching occurs on the LLDB side, completely 114transparent to the formatter itself. 115 116The design of synthetic children is somewhat more intricate, due to them being 117stateful objects. The core idea of the design is that synthetic children act 118like a two-tier model, in which there is a backend dataset (the underlying 119unformatted ``ValueObject``), and an higher level view (frontend) which vends 120the computed representation. 121 122To implement a new type of synthetic children one would implement a subclass of 123``SyntheticChildren``, which akin to the ``TypeFormatImpl``, contains Flags for 124matching, and data items to be used for formatting. For instance, 125``TypeFilterImpl`` (which implements filters), stores the list of expression 126paths of the children to be displayed. 127 128Filters are themselves synthetic children. Since all they do is provide child 129values for a ``ValueObject``, it does not truly matter whether these come from the 130real set of children or are crafted through some intricate algorithm. As such, 131they perfectly fit within the realm of synthetic children and are only shown as 132separate entities for user friendliness (to a user, picking a subset of 133elements to be shown with relative ease is a valuable task, and they should not 134be concerned with writing scripts to do so). 135 136Once the descriptor of the synthetic children has been coded, in order to hook 137it up, one has to implement a subclass of ``SyntheticChildrenFrontEnd``. For a 138given type of synthetic children, there is a deep coupling with the matching 139front-end class, given that the front-end usually needs data stored in the 140descriptor (e.g. a filter needs the list of child elements). 141 142The front-end answers the interesting questions that are the true raison d'être 143of synthetic children: 144 145:: 146 147 virtual size_t CalculateNumChildren () = 0; 148 virtual lldb::ValueObjectSP GetChildAtIndex (size_t idx) = 0; 149 virtual size_t GetIndexOfChildWithName (const ConstString &name) = 0; 150 virtual bool Update () = 0; 151 virtual bool MightHaveChildren () = 0; 152 153Synthetic children providers (their front-ends) will be queried by LLDB for a 154number of children, and then for each of them as necessary, they should be 155prepared to return a ``ValueObject`` describing the child. They might also be 156asked to provide a name-to-index mapping (e.g. to allow LLDB to resolve queries 157like ``myFoo.myChild``). 158 159``Update()`` and ``MightHaveChildren()`` are described in the user 160documentation, and they mostly serve bookkeeping purposes. 161 162LLDB provides three kinds of synthetic children: filters, scripted synthetics, 163and the native C++ providers Filters are implemented by 164``TypeFilterImpl::FrontEnd``. 165 166Scripted synthetics are implemented by ``ScriptedSyntheticChildren::FrontEnd``, 167plus a set of callbacks provided by the ``ScriptInterpteter`` infrastructure to 168allow LLDB to pass the front-end queries down to the scripting languages. 169 170As for C++ native synthetics, there is a ``CXXSyntheticChildren``, but no 171corresponding ``FrontEnd`` class. The reason for this design is that 172``CXXSyntheticChildren`` store a callback to a creator function, which is 173responsible for providing a ``FrontEnd``. Each individual formatter (e.g. 174``LibstdcppMapIteratorSyntheticFrontEnd``) is a standalone frontend, and once 175created retains to relation to its underlying ``SyntheticChildren`` object. 176 177On a ``ValueObject`` level, upon being asked to generate synthetic children for 178a ``ValueObject``, LLDB spawns a ValueObjectSynthetic object which is a 179subclass of ``ValueObject``. Building upon the ``ValueObject`` infrastructure, 180it stores a backend, and a shared pointer to the ``SyntheticChildren``. Upon 181being asked queries about children, it will use the ``SyntheticChildren`` to 182generate a front-end for itself and will let the front-end answer questions. 183The reason for not storing the ``FrontEnd`` itself is that there is no 184guarantee that across updates, the same ``FrontEnd`` will be used over and over 185(e.g. a ``SyntheticChildren`` object could serve an entire class hierarchy and 186vend different frontends for different subclasses). 187 188Formatters Matching 189------------------- 190 191The problem of formatters matching is going from "I have a ``ValueObject``" to 192"these are the formatters to be used for it." 193 194There is a rather intricate set of user rules that are involved, and a rather 195intricate implementation of this model. All of these relate to the type of the 196``ValueObject``. It is assumed that types are a strong enough contract that it 197is possible to format an object entirely depending on its type. If this turns 198out to not be correct, then the existing model will have to be changed fairly 199deeply. 200 201The basic building block is that formatters can match by exact type name or by 202regular expressions, i.e. one can describe matching by saying things like "this 203formatters matches type ``__NSDictionaryI``", or "this formatter matches all 204type names like ``^std::__1::vector<.+>(( )?&)?$``." 205 206This match happens in class ``FormattersContainer``. For exact matches, this 207goes straight to the ``FormatMap`` (the actual storage area for formatters), 208whereas for regular expression matches the regular expression is matched 209against the provided candidate type name. If one were to introduce a new type 210of matching (say, match against number of ``$`` signs present in the typename, 211``FormattersContainer`` is the place where such a change would have to be 212introduced). 213 214It should be noted that this code involves template specialization, and as such 215is somewhat trickier than other formatters code to update. 216 217On top of the string matching mechanism (exact or regex), there are a set of 218more advanced rules implemented by the ``FormattersContainer``, with the aid of the 219``FormattersMatchCandidate``. Namely, it is assumed that any formatter class will 220have flags to say whether it allows cascading (i.e. seeing through typedefs), 221allowing pointers-to-object and reference-to-object to be formatted. Upon 222verifying that a formatter would be a textual match, the Flags are checked, and 223if they do not allow the formatter to be used (e.g. pointers are not allowed, 224and one is looking at a Foo*), then the formatter is rejected and the search 225continues. If the flags also match, then the formatter is returned upstream and 226the search is over. 227 228One relevant fact to notice is that this entire mechanism is not dependent on 229the kind of formatter to be returned, which makes it easier to devise new types 230of formatters as the lowest layers of the system. The demands on individual 231formatters are that they define a few typedefs, and export a Flags object, and 232then they can be freely matched against types as needed. 233 234This mechanism is replicated across a number of categories. A category is a 235named bucket where formatters are grouped on some basis. The most common reason 236for a category to exist is a library (e.g. ``libcxx`` formatters vs. ``libstdcpp`` 237formatters). Categories can be enabled or disabled, and they have a priority 238number, called position. The priority sets a strong order among enabled 239categories. A category named "default" is always the highest priority one and 240it's the category where all formatters that do not ask for a category of their 241own end up (e.g. ``type summary add ....`` without a ``w somecategory`` flag 242passed) The algorithm inquires each category, in the order of their priorities, 243for a formatter for a type, and upon receiving a positive answer from a 244category, ends the search. Of course, no search occurs in disabled categories. 245 246At the individual category level, there is the first dependence on the type of 247formatter to be returned. Since both filters and synthetic children proper are 248implemented through the same backing store, the matching code needs to ensure 249that, were both a synthetic children provider and a filter to match a type, 250only the most recently added one is actually used. The details of the algorithm 251used are to be found in ``TypeCategoryImpl::Get()``. 252 253It is quite obvious, even to a casual reader, that there are a number of 254complexities involved in this algorithm. For starters, the entire search 255process has to be repeated for every variable. Moreover, for each category, one 256has to repeat the entire process of crawling the types (go to pointee, ...). 257This is exactly the algorithm initially implemented by LLDB. Over the course of 258the life of the formatters subsystem, two main evolutions have been made to the 259matching mechanism: 260 261* A caching mechanism 262* A pregeneration of all possible type matches 263 264The cache is a layer that sits between the ``FormatManager`` and the 265``TypeCategoryMap``. Upon being asked to figure out a formatter, the ``FormatManager`` 266will first query the cache layer, and only if that fails, will the categories 267be queried using the full search algorithm. The result of that full search will 268then be stored in the cache. Even a negative answer (no formatter) gets stored. 269The negative answer is actually the most beneficial to cache as obtaining it 270requires traversing all possible formatters in all categories just to get a 271no-op back. 272 273Of course, once an answer is cached, getting it will be much quicker than going 274to a full category search, as the cached answers are of the form "type foo" --> 275"formatter bar". But given how formatters can be edited or removed by the user, 276either at the command line or via the API, there needs to be a way to 277invalidate the cache. 278 279This happens through the ``FormatManager::Changed()`` method. In general, anything 280that changes the formatters causes ``FormatManager::Changed()`` to be called 281through the ``IFormatChangeListener`` interface. This call increases the 282``FormatManager``'s revision and clears the cache. The revision number is a 283monotonically increasing integer counter that essentially corresponds to the 284number of changes made to the formatters throughout the current LLDB session. 285This counter is used by ``ValueObjects`` to know when their formatters are out of 286date. Since a search is a potentially expensive operation, before caching was 287introduced, individual ``ValueObjects`` remembered which revision of the 288``FormatManager`` they used to search for their formatter, and stored it, so that 289they would not repeat the search unless a change in the formatters had 290occurred. While caching has made this less critical of an optimization, it is 291still sensible and thus is kept. 292 293Lastly, as a side note, it is worth highlighting that any change in the 294formatters invalidates the entire cache. It would likely not be impossible to 295be smarter and figure out a subset of cache entries to be deleted, letting 296others persist, instead of having to rebuild the entire cache from scratch. 297However, given that formatters are not that frequently changed during a debug 298session, and the algorithmic complexity to "get it right" seems larger than the 299potential benefit to be had from doing it, the full cache invalidation is the 300chosen policy. The algorithm to selectively invalidate entries is probably one 301of the major areas for improvements in formatters performance. 302 303The second major optimization, introduced fairly recently, is the pregeneration 304of type matches. The original algorithm was based upon the notion of a 305``FormatNavigator`` as a smart object, aware of all the intricacies of the 306matching rules. For each category, the ``FormatNavigator`` would generate the 307possible matches (e.g. dynamic type, pointee type, ...), and check each one, 308one at a time. If that failed for a category, the next one would again generate 309the same matches. 310 311This worked well, but was of course inefficient. The 312``FormattersMatchCandidate`` is the solution to this performance issue. In 313top-of-tree LLDB, the ``FormatManager`` has the centralized notion of the 314matching rules, and the former ``FormatNavigators`` are now 315``FormattersContainers``, whose only job is to guarantee a centralized storage 316of formatters, and thread-safe access to such storage. 317 318``FormatManager::GetPossibleMatches()`` fills a vector of possible matches. The 319way it works is by applying each rule, generating the corresponding typename, 320and storing the typename, plus the required Flags for that rule to be accepted 321as a match candidate (e.g. if the match comes by fetching the pointee type, a 322formatter that matches will have to allow pointees as part of its Flags 323object). The ``TypeCategoryMap``, when tasked with finding a formatter for a 324type, generates all possible matches and passes them down to each category. In 325this model, the type system only does its (expensive) job once, and textual or 326regex matches are the core of the work. 327 328FormatManager and DataVisualization 329----------------------------------- 330 331There are two main entry points in the data formatters: the ``FormatManager`` and 332the ``DataVisualization``. 333 334The ``FormatManager`` is the internal such entry point. In this context, 335internal refers to data formatters code itself, compared to other parts of 336LLDB. For other components of the debugger, the ``DataVisualization`` provides 337a more stable entry point. On the other hand, the ``FormatManager`` is an 338aggregator of all moving parts, and as such is less stable in the face of 339refactoring. 340 341People involved in the data formatters code itself, however, will most likely 342have to confront the ``FormatManager`` for significant architecture changes. 343 344The ``FormatManager`` wraps a ``TypeCategoryMap`` (the list of all existing 345categories, enabled and not), the ``FormatCache``, and several utility objects. 346Plus, it is the repository of named summaries, since these don't logically 347belong anywhere else. 348 349It is also responsible for creating all builtin formatters upon the launch of 350LLDB. It does so through a bunch of methods ``Load***Formatters()``, invoked as 351part of its constructor. The original design of data formatters anticipated 352that individual libraries would load their formatters as part of their debug 353information. This work however has largely been left unattended in practice, 354and as such core system libraries (mostly those for masOS/iOS development as of 355today) load their formatters in an hardcoded fashion. 356 357For performance reasons, the ``FormatManager`` is constructed upon being first 358required. This happens through the ``DataVisualization`` layer. Upon first 359being inquired for anything formatters, ``DataVisualization`` calls its own 360local static function ``GetFormatManager()``, which in turns constructs and 361returns a local static ``FormatManager``. 362 363Unlike most things in LLDB, the lifetime of the ``FormatManager`` is the same 364as the entire session, rather than a specific ``Debugger`` or ``Target`` 365instance. This is an area to be improved, but as of now it has not caused 366enough grief to warrant action. If this work were to be undertaken, one could 367conceivably devise a per-architecture-triple model, upon the assumption that an 368OS and CPU combination are a good enough key to decide which formatters apply 369(e.g. Linux i386 is probably different from masOS x86_64, but two macOS x86_64 370targets will probably have the same formatters; of course versioning of the 371underlying OS is also to be considered, but experience with OSX has shown that 372formatters can take care of that internally in most cases of interest). 373 374The public entry point is the ``DataVisualization`` layer. 375``DataVisualization`` is a static class on which questions can be asked in a 376relatively refactoring-safe manner. 377 378The main question asked of it is to obtain formatters for ``ValueObjects`` (or 379typenames). One can also query ``DataVisualization`` for named summaries or 380individual categories, but of course those queries delve deeper in the internal 381object model. 382 383As said, the ``FormatManager`` holds a notion of revision number, which changes 384every time formatters are edited (added, deleted, categories enabled or 385disabled, ...). Through ``DataVisualization::ForceUpdate()`` one can cause the 386same effects of a formatters edit to happen without it actually having 387happened. 388 389The main reason for this feature is that formatters can be dynamically created 390in Python, and one can then enter the ``ScriptInterpreter`` and edit the 391formatter function or class. If formatters were not updated, one could find 392them to be out of sync with the new definitions of these objects. To avoid the 393issue, whenever the user exits the scripting mode, formatters force an update 394to make sure new potential definitions are reloaded on demand. 395 396The SWIG Layer 397-------------- 398 399In order to implement formatters written in Python, LLDB requires that 400``ScriptInterpreter`` implementations provide a set of functions that one can call 401to ask formatting questions of scripts. 402 403For instance, in order to obtain a scripting summary, LLDB calls: 404 405:: 406 407 virtual bool 408 GetScriptedSummary(const char *function_name, llldb::ValueObjectSP valobj, 409 lldb::ScriptInterpreterObjectSP &callee_wrapper_sp, 410 std::string &retval) 411 412 413For Python, this function is implemented by first checking if the 414``callee_wrapper_sp`` is valid. If so, LLDB knows that it does not need to 415search a function with the passed name, and can directly call the wrapped 416Python function object. Either way, the call is routed to a global callback 417``g_swig_typescript_callback``. 418 419This callback pointer points to ``LLDBSwigPythonCallTypeScript``. The details 420of the implementation require familiarity with the Python C API, plus a few 421utility objects defined by LLDB to ease the burden of dealing with the 422scripting world. However, as a sketch of what happens, the code tries to find a 423Python function object with the given name (i.e. if you say ``type summary add 424-F module.function`` LLDB will scan for the ``module`` module, and then for a 425function named ``function`` inside the module's namespace). If the function 426object is found, it is wrapped in a ``PyCallable``, which is an LLDB utility class 427that wraps the callable and allows for easier calling. The callable gets 428invoked, and the return value, if any, is cast into a string. Originally, if a 429non-string object was returned, LLDB would refuse to use it. This disallowed 430such simple construct as: 431 432:: 433 434 def getSummary(value,*args): 435 return 1 436 437Similar considerations apply to other formatter (and non-formatter related) 438scripting callbacks. 439