1============================ 2"Clang" CFE Internals Manual 3============================ 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11This document describes some of the more important APIs and internal design 12decisions made in the Clang C front-end. The purpose of this document is to 13both capture some of this high level information and also describe some of the 14design decisions behind it. This is meant for people interested in hacking on 15Clang, not for end-users. The description below is categorized by libraries, 16and does not describe any of the clients of the libraries. 17 18LLVM Support Library 19==================== 20 21The LLVM ``libSupport`` library provides many underlying libraries and 22`data-structures <https://llvm.org/docs/ProgrammersManual.html>`_, including 23command line option processing, various containers and a system abstraction 24layer, which is used for file system access. 25 26The Clang "Basic" Library 27========================= 28 29This library certainly needs a better name. The "basic" library contains a 30number of low-level utilities for tracking and manipulating source buffers, 31locations within the source buffers, diagnostics, tokens, target abstraction, 32and information about the subset of the language being compiled for. 33 34Part of this infrastructure is specific to C (such as the ``TargetInfo`` 35class), other parts could be reused for other non-C-based languages 36(``SourceLocation``, ``SourceManager``, ``Diagnostics``, ``FileManager``). 37When and if there is future demand we can figure out if it makes sense to 38introduce a new library, move the general classes somewhere else, or introduce 39some other solution. 40 41We describe the roles of these classes in order of their dependencies. 42 43The Diagnostics Subsystem 44------------------------- 45 46The Clang Diagnostics subsystem is an important part of how the compiler 47communicates with the human. Diagnostics are the warnings and errors produced 48when the code is incorrect or dubious. In Clang, each diagnostic produced has 49(at the minimum) a unique ID, an English translation associated with it, a 50:ref:`SourceLocation <SourceLocation>` to "put the caret", and a severity 51(e.g., ``WARNING`` or ``ERROR``). They can also optionally include a number of 52arguments to the diagnostic (which fill in "%0"'s in the string) as well as a 53number of source ranges that related to the diagnostic. 54 55In this section, we'll be giving examples produced by the Clang command line 56driver, but diagnostics can be :ref:`rendered in many different ways 57<DiagnosticConsumer>` depending on how the ``DiagnosticConsumer`` interface is 58implemented. A representative example of a diagnostic is: 59 60.. code-block:: text 61 62 t.c:38:15: error: invalid operands to binary expression ('int *' and '_Complex float') 63 P = (P-42) + Gamma*4; 64 ~~~~~~ ^ ~~~~~~~ 65 66In this example, you can see the English translation, the severity (error), you 67can see the source location (the caret ("``^``") and file/line/column info), 68the source ranges "``~~~~``", arguments to the diagnostic ("``int*``" and 69"``_Complex float``"). You'll have to believe me that there is a unique ID 70backing the diagnostic :). 71 72Getting all of this to happen has several steps and involves many moving 73pieces, this section describes them and talks about best practices when adding 74a new diagnostic. 75 76The ``Diagnostic*Kinds.td`` files 77^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 78 79Diagnostics are created by adding an entry to one of the 80``clang/Basic/Diagnostic*Kinds.td`` files, depending on what library will be 81using it. From this file, :program:`tblgen` generates the unique ID of the 82diagnostic, the severity of the diagnostic and the English translation + format 83string. 84 85There is little sanity with the naming of the unique ID's right now. Some 86start with ``err_``, ``warn_``, ``ext_`` to encode the severity into the name. 87Since the enum is referenced in the C++ code that produces the diagnostic, it 88is somewhat useful for it to be reasonably short. 89 90The severity of the diagnostic comes from the set {``NOTE``, ``REMARK``, 91``WARNING``, 92``EXTENSION``, ``EXTWARN``, ``ERROR``}. The ``ERROR`` severity is used for 93diagnostics indicating the program is never acceptable under any circumstances. 94When an error is emitted, the AST for the input code may not be fully built. 95The ``EXTENSION`` and ``EXTWARN`` severities are used for extensions to the 96language that Clang accepts. This means that Clang fully understands and can 97represent them in the AST, but we produce diagnostics to tell the user their 98code is non-portable. The difference is that the former are ignored by 99default, and the later warn by default. The ``WARNING`` severity is used for 100constructs that are valid in the currently selected source language but that 101are dubious in some way. The ``REMARK`` severity provides generic information 102about the compilation that is not necessarily related to any dubious code. The 103``NOTE`` level is used to staple more information onto previous diagnostics. 104 105These *severities* are mapped into a smaller set (the ``Diagnostic::Level`` 106enum, {``Ignored``, ``Note``, ``Remark``, ``Warning``, ``Error``, ``Fatal``}) of 107output 108*levels* by the diagnostics subsystem based on various configuration options. 109Clang internally supports a fully fine grained mapping mechanism that allows 110you to map almost any diagnostic to the output level that you want. The only 111diagnostics that cannot be mapped are ``NOTE``\ s, which always follow the 112severity of the previously emitted diagnostic and ``ERROR``\ s, which can only 113be mapped to ``Fatal`` (it is not possible to turn an error into a warning, for 114example). 115 116Diagnostic mappings are used in many ways. For example, if the user specifies 117``-pedantic``, ``EXTENSION`` maps to ``Warning``, if they specify 118``-pedantic-errors``, it turns into ``Error``. This is used to implement 119options like ``-Wunused_macros``, ``-Wundef`` etc. 120 121Mapping to ``Fatal`` should only be used for diagnostics that are considered so 122severe that error recovery won't be able to recover sensibly from them (thus 123spewing a ton of bogus errors). One example of this class of error are failure 124to ``#include`` a file. 125 126Diagnostic Wording 127^^^^^^^^^^^^^^^^^^ 128The wording used for a diagnostic is critical because it is the only way for a 129user to know how to correct their code. Use the following suggestions when 130wording a diagnostic. 131 132* Diagnostics in Clang do not start with a capital letter and do not end with 133 punctuation. 134 135 * This does not apply to proper nouns like ``Clang`` or ``OpenMP``, to 136 acronyms like ``GCC`` or ``ARC``, or to language standards like ``C23`` 137 or ``C++17``. 138 * A trailing question mark is allowed. e.g., ``unknown identifier %0; did 139 you mean %1?``. 140 141* Appropriately capitalize proper nouns like ``Clang``, ``OpenCL``, ``GCC``, 142 ``Objective-C``, etc and language standard versions like ``C11`` or ``C++11``. 143* The wording should be succinct. If necessary, use a semicolon to combine 144 sentence fragments instead of using complete sentences. e.g., prefer wording 145 like ``'%0' is deprecated; it will be removed in a future release of Clang`` 146 over wording like ``'%0' is deprecated. It will be removed in a future release 147 of Clang``. 148* The wording should be actionable and avoid using standards terms or grammar 149 productions that a new user would not be familiar with. e.g., prefer wording 150 like ``missing semicolon`` over wording like ``syntax error`` (which is not 151 actionable) or ``expected unqualified-id`` (which uses standards terminology). 152* The wording should clearly explain what is wrong with the code rather than 153 restating what the code does. e.g., prefer wording like ``type %0 requires a 154 value in the range %1 to %2`` over wording like ``%0 is invalid``. 155* The wording should have enough contextual information to help the user 156 identify the issue in a complex expression. e.g., prefer wording like 157 ``both sides of the %0 binary operator are identical`` over wording like 158 ``identical operands to binary operator``. 159* Use single quotes to denote syntactic constructs or command line arguments 160 named in a diagnostic message. e.g., prefer wording like ``'this' pointer 161 cannot be null in well-defined C++ code`` over wording like ``this pointer 162 cannot be null in well-defined C++ code``. 163* Prefer diagnostic wording without contractions whenever possible. The single 164 quote in a contraction can be visually distracting due to its use with 165 syntactic constructs and contractions can be harder to understand for non- 166 native English speakers. 167 168The Format String 169^^^^^^^^^^^^^^^^^ 170 171The format string for the diagnostic is very simple, but it has some power. It 172takes the form of a string in English with markers that indicate where and how 173arguments to the diagnostic are inserted and formatted. For example, here are 174some simple format strings: 175 176.. code-block:: c++ 177 178 "binary integer literals are an extension" 179 "format string contains '\\0' within the string body" 180 "more '%%' conversions than data arguments" 181 "invalid operands to binary expression (%0 and %1)" 182 "overloaded '%0' must be a %select{unary|binary|unary or binary}2 operator" 183 " (has %1 parameter%s1)" 184 185These examples show some important points of format strings. You can use any 186plain ASCII character in the diagnostic string except "``%``" without a 187problem, but these are C strings, so you have to use and be aware of all the C 188escape sequences (as in the second example). If you want to produce a "``%``" 189in the output, use the "``%%``" escape sequence, like the third diagnostic. 190Finally, Clang uses the "``%...[digit]``" sequences to specify where and how 191arguments to the diagnostic are formatted. 192 193Arguments to the diagnostic are numbered according to how they are specified by 194the C++ code that :ref:`produces them <internals-producing-diag>`, and are 195referenced by ``%0`` .. ``%9``. If you have more than 10 arguments to your 196diagnostic, you are doing something wrong :). Unlike ``printf``, there is no 197requirement that arguments to the diagnostic end up in the output in the same 198order as they are specified, you could have a format string with "``%1 %0``" 199that swaps them, for example. The text in between the percent and digit are 200formatting instructions. If there are no instructions, the argument is just 201turned into a string and substituted in. 202 203Here are some "best practices" for writing the English format string: 204 205* Keep the string short. It should ideally fit in the 80 column limit of the 206 ``DiagnosticKinds.td`` file. This avoids the diagnostic wrapping when 207 printed, and forces you to think about the important point you are conveying 208 with the diagnostic. 209* Take advantage of location information. The user will be able to see the 210 line and location of the caret, so you don't need to tell them that the 211 problem is with the 4th argument to the function: just point to it. 212* Do not capitalize the diagnostic string, and do not end it with a period. 213* If you need to quote something in the diagnostic string, use single quotes. 214 215Diagnostics should never take random English strings as arguments: you 216shouldn't use "``you have a problem with %0``" and pass in things like "``your 217argument``" or "``your return value``" as arguments. Doing this prevents 218:ref:`translating <internals-diag-translation>` the Clang diagnostics to other 219languages (because they'll get random English words in their otherwise 220localized diagnostic). The exceptions to this are C/C++ language keywords 221(e.g., ``auto``, ``const``, ``mutable``, etc) and C/C++ operators (``/=``). 222Note that things like "pointer" and "reference" are not keywords. On the other 223hand, you *can* include anything that comes from the user's source code, 224including variable names, types, labels, etc. The "``select``" format can be 225used to achieve this sort of thing in a localizable way, see below. 226 227Formatting a Diagnostic Argument 228^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 229 230Arguments to diagnostics are fully typed internally, and come from a couple 231different classes: integers, types, names, and random strings. Depending on 232the class of the argument, it can be optionally formatted in different ways. 233This gives the ``DiagnosticConsumer`` information about what the argument means 234without requiring it to use a specific presentation (consider this MVC for 235Clang :). 236 237It is really easy to add format specifiers to the Clang diagnostics system, but 238they should be discussed before they are added. If you are creating a lot of 239repetitive diagnostics and/or have an idea for a useful formatter, please bring 240it up on the cfe-dev mailing list. 241 242Here are the different diagnostic argument formats currently supported by 243Clang: 244 245**"s" format** 246 247Example: 248 ``"requires %0 parameter%s0"`` 249Class: 250 Integers 251Description: 252 This is a simple formatter for integers that is useful when producing English 253 diagnostics. When the integer is 1, it prints as nothing. When the integer 254 is not 1, it prints as "``s``". This allows some simple grammatical forms to 255 be to be handled correctly, and eliminates the need to use gross things like 256 ``"requires %1 parameter(s)"``. Note, this only handles adding a simple 257 "``s``" character, it will not handle situations where pluralization is more 258 complicated such as turning ``fancy`` into ``fancies`` or ``mouse`` into 259 ``mice``. You can use the "plural" format specifier to handle such situations. 260 261**"select" format** 262 263Example: 264 ``"must be a %select{unary|binary|unary or binary}0 operator"`` 265Class: 266 Integers 267Description: 268 This format specifier is used to merge multiple related diagnostics together 269 into one common one, without requiring the difference to be specified as an 270 English string argument. Instead of specifying the string, the diagnostic 271 gets an integer argument and the format string selects the numbered option. 272 In this case, the "``%0``" value must be an integer in the range [0..2]. If 273 it is 0, it prints "unary", if it is 1 it prints "binary" if it is 2, it 274 prints "unary or binary". This allows other language translations to 275 substitute reasonable words (or entire phrases) based on the semantics of the 276 diagnostic instead of having to do things textually. The selected string 277 does undergo formatting. 278 279**"enum_select format** 280 281Example: 282 ``unknown frobbling of a %enum_select<FrobbleKind>{%VarDecl{variable declaration}|%FuncDecl{function declaration}}0 when blarging`` 283Class: 284 Integers 285Description: 286 This format specifier is used exactly like a ``select`` specifier, except it 287 additionally generates a namespace, enumeration, and enumerator list based on 288 the format string given. In the above case, a namespace is generated named 289 ``FrobbleKind`` that has an unscoped enumeration with the enumerators 290 ``VarDecl`` and ``FuncDecl`` which correspond to the values 0 and 1. This 291 permits a clearer use of the ``Diag`` in source code, as the above could be 292 called as: ``Diag(Loc, diag::frobble) << diag::FrobbleKind::VarDecl``. 293 294**"plural" format** 295 296Example: 297 ``"you have %0 %plural{1:mouse|:mice}0 connected to your computer"`` 298Class: 299 Integers 300Description: 301 This is a formatter for complex plural forms. It is designed to handle even 302 the requirements of languages with very complex plural forms, as many Baltic 303 languages have. The argument consists of a series of expression/form pairs, 304 separated by ":", where the first form whose expression evaluates to true is 305 the result of the modifier. 306 307 An expression can be empty, in which case it is always true. See the example 308 at the top. Otherwise, it is a series of one or more numeric conditions, 309 separated by ",". If any condition matches, the expression matches. Each 310 numeric condition can take one of three forms. 311 312 * number: A simple decimal number matches if the argument is the same as the 313 number. Example: ``"%plural{1:mouse|:mice}0"`` 314 * range: A range in square brackets matches if the argument is within the 315 range. Then range is inclusive on both ends. Example: 316 ``"%plural{0:none|1:one|[2,5]:some|:many}0"`` 317 * modulo: A modulo operator is followed by a number, and equals sign and 318 either a number or a range. The tests are the same as for plain numbers 319 and ranges, but the argument is taken modulo the number first. Example: 320 ``"%plural{%100=0:even hundred|%100=[1,50]:lower half|:everything else}1"`` 321 322 The parser is very unforgiving. A syntax error, even whitespace, will abort, 323 as will a failure to match the argument against any expression. 324 325**"ordinal" format** 326 327Example: 328 ``"ambiguity in %ordinal0 argument"`` 329Class: 330 Integers 331Description: 332 This is a formatter which represents the argument number as an ordinal: the 333 value ``1`` becomes ``1st``, ``3`` becomes ``3rd``, and so on. Values less 334 than ``1`` are not supported. This formatter is currently hard-coded to use 335 English ordinals. 336 337**"human" format** 338 339Example: 340 ``"total size is %human0 bytes"`` 341Class: 342 Integers 343Description: 344 This is a formatter which represents the argument number in a human readable 345 format: the value ``123`` stays ``123``, ``12345`` becomes ``12.34k``, 346 ``6666666` becomes ``6.67M``, and so on for 'G' and 'T'. 347 348**"objcclass" format** 349 350Example: 351 ``"method %objcclass0 not found"`` 352Class: 353 ``DeclarationName`` 354Description: 355 This is a simple formatter that indicates the ``DeclarationName`` corresponds 356 to an Objective-C class method selector. As such, it prints the selector 357 with a leading "``+``". 358 359**"objcinstance" format** 360 361Example: 362 ``"method %objcinstance0 not found"`` 363Class: 364 ``DeclarationName`` 365Description: 366 This is a simple formatter that indicates the ``DeclarationName`` corresponds 367 to an Objective-C instance method selector. As such, it prints the selector 368 with a leading "``-``". 369 370**"q" format** 371 372Example: 373 ``"candidate found by name lookup is %q0"`` 374Class: 375 ``NamedDecl *`` 376Description: 377 This formatter indicates that the fully-qualified name of the declaration 378 should be printed, e.g., "``std::vector``" rather than "``vector``". 379 380**"diff" format** 381 382Example: 383 ``"no known conversion %diff{from $ to $|from argument type to parameter type}1,2"`` 384Class: 385 ``QualType`` 386Description: 387 This formatter takes two ``QualType``\ s and attempts to print a template 388 difference between the two. If tree printing is off, the text inside the 389 braces before the pipe is printed, with the formatted text replacing the $. 390 If tree printing is on, the text after the pipe is printed and a type tree is 391 printed after the diagnostic message. 392 393**"sub" format** 394 395Example: 396 Given the following record definition of type ``TextSubstitution``: 397 398 .. code-block:: text 399 400 def select_ovl_candidate : TextSubstitution< 401 "%select{function|constructor}0%select{| template| %2}1">; 402 403 which can be used as 404 405 .. code-block:: text 406 407 def note_ovl_candidate : Note< 408 "candidate %sub{select_ovl_candidate}3,2,1 not viable">; 409 410 and will act as if it was written 411 ``"candidate %select{function|constructor}3%select{| template| %1}2 not viable"``. 412Description: 413 This format specifier is used to avoid repeating strings verbatim in multiple 414 diagnostics. The argument to ``%sub`` must name a ``TextSubstitution`` tblgen 415 record. The substitution must specify all arguments used by the substitution, 416 and the modifier indexes in the substitution are re-numbered accordingly. The 417 substituted text must itself be a valid format string before substitution. 418 419.. _internals-producing-diag: 420 421Producing the Diagnostic 422^^^^^^^^^^^^^^^^^^^^^^^^ 423 424Now that you've created the diagnostic in the ``Diagnostic*Kinds.td`` file, you 425need to write the code that detects the condition in question and emits the new 426diagnostic. Various components of Clang (e.g., the preprocessor, ``Sema``, 427etc.) provide a helper function named "``Diag``". It creates a diagnostic and 428accepts the arguments, ranges, and other information that goes along with it. 429 430For example, the binary expression error comes from code like this: 431 432.. code-block:: c++ 433 434 if (various things that are bad) 435 Diag(Loc, diag::err_typecheck_invalid_operands) 436 << lex->getType() << rex->getType() 437 << lex->getSourceRange() << rex->getSourceRange(); 438 439This shows that use of the ``Diag`` method: it takes a location (a 440:ref:`SourceLocation <SourceLocation>` object) and a diagnostic enum value 441(which matches the name from ``Diagnostic*Kinds.td``). If the diagnostic takes 442arguments, they are specified with the ``<<`` operator: the first argument 443becomes ``%0``, the second becomes ``%1``, etc. The diagnostic interface 444allows you to specify arguments of many different types, including ``int`` and 445``unsigned`` for integer arguments, ``const char*`` and ``std::string`` for 446string arguments, ``DeclarationName`` and ``const IdentifierInfo *`` for names, 447``QualType`` for types, etc. ``SourceRange``\ s are also specified with the 448``<<`` operator, but do not have a specific ordering requirement. 449 450As you can see, adding and producing a diagnostic is pretty straightforward. 451The hard part is deciding exactly what you need to say to help the user, 452picking a suitable wording, and providing the information needed to format it 453correctly. The good news is that the call site that issues a diagnostic should 454be completely independent of how the diagnostic is formatted and in what 455language it is rendered. 456 457Fix-It Hints 458^^^^^^^^^^^^ 459 460In some cases, the front end emits diagnostics when it is clear that some small 461change to the source code would fix the problem. For example, a missing 462semicolon at the end of a statement or a use of deprecated syntax that is 463easily rewritten into a more modern form. Clang tries very hard to emit the 464diagnostic and recover gracefully in these and other cases. 465 466However, for these cases where the fix is obvious, the diagnostic can be 467annotated with a hint (referred to as a "fix-it hint") that describes how to 468change the code referenced by the diagnostic to fix the problem. For example, 469it might add the missing semicolon at the end of the statement or rewrite the 470use of a deprecated construct into something more palatable. Here is one such 471example from the C++ front end, where we warn about the right-shift operator 472changing meaning from C++98 to C++11: 473 474.. code-block:: text 475 476 test.cpp:3:7: warning: use of right-shift operator ('>>') in template argument 477 will require parentheses in C++11 478 A<100 >> 2> *a; 479 ^ 480 ( ) 481 482Here, the fix-it hint is suggesting that parentheses be added, and showing 483exactly where those parentheses would be inserted into the source code. The 484fix-it hints themselves describe what changes to make to the source code in an 485abstract manner, which the text diagnostic printer renders as a line of 486"insertions" below the caret line. :ref:`Other diagnostic clients 487<DiagnosticConsumer>` might choose to render the code differently (e.g., as 488markup inline) or even give the user the ability to automatically fix the 489problem. 490 491Fix-it hints on errors and warnings need to obey these rules: 492 493* Since they are automatically applied if ``-Xclang -fixit`` is passed to the 494 driver, they should only be used when it's very likely they match the user's 495 intent. 496* Clang must recover from errors as if the fix-it had been applied. 497* Fix-it hints on a warning must not change the meaning of the code. 498 However, a hint may clarify the meaning as intentional, for example by adding 499 parentheses when the precedence of operators isn't obvious. 500 501If a fix-it can't obey these rules, put the fix-it on a note. Fix-its on notes 502are not applied automatically. 503 504All fix-it hints are described by the ``FixItHint`` class, instances of which 505should be attached to the diagnostic using the ``<<`` operator in the same way 506that highlighted source ranges and arguments are passed to the diagnostic. 507Fix-it hints can be created with one of three constructors: 508 509* ``FixItHint::CreateInsertion(Loc, Code)`` 510 511 Specifies that the given ``Code`` (a string) should be inserted before the 512 source location ``Loc``. 513 514* ``FixItHint::CreateRemoval(Range)`` 515 516 Specifies that the code in the given source ``Range`` should be removed. 517 518* ``FixItHint::CreateReplacement(Range, Code)`` 519 520 Specifies that the code in the given source ``Range`` should be removed, 521 and replaced with the given ``Code`` string. 522 523.. _DiagnosticConsumer: 524 525The ``DiagnosticConsumer`` Interface 526^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 527 528Once code generates a diagnostic with all of the arguments and the rest of the 529relevant information, Clang needs to know what to do with it. As previously 530mentioned, the diagnostic machinery goes through some filtering to map a 531severity onto a diagnostic level, then (assuming the diagnostic is not mapped 532to "``Ignore``") it invokes an object that implements the ``DiagnosticConsumer`` 533interface with the information. 534 535It is possible to implement this interface in many different ways. For 536example, the normal Clang ``DiagnosticConsumer`` (named 537``TextDiagnosticPrinter``) turns the arguments into strings (according to the 538various formatting rules), prints out the file/line/column information and the 539string, then prints out the line of code, the source ranges, and the caret. 540However, this behavior isn't required. 541 542Another implementation of the ``DiagnosticConsumer`` interface is the 543``TextDiagnosticBuffer`` class, which is used when Clang is in ``-verify`` 544mode. Instead of formatting and printing out the diagnostics, this 545implementation just captures and remembers the diagnostics as they fly by. 546Then ``-verify`` compares the list of produced diagnostics to the list of 547expected ones. If they disagree, it prints out its own output. Full 548documentation for the ``-verify`` mode can be found at 549:ref:`verifying-diagnostics`. 550 551There are many other possible implementations of this interface, and this is 552why we prefer diagnostics to pass down rich structured information in 553arguments. For example, an HTML output might want declaration names be 554linkified to where they come from in the source. Another example is that a GUI 555might let you click on typedefs to expand them. This application would want to 556pass significantly more information about types through to the GUI than a 557simple flat string. The interface allows this to happen. 558 559.. _internals-diag-translation: 560 561Adding Translations to Clang 562^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 563 564Not possible yet! Diagnostic strings should be written in UTF-8, the client can 565translate to the relevant code page if needed. Each translation completely 566replaces the format string for the diagnostic. 567 568.. _SourceLocation: 569.. _SourceManager: 570 571The ``SourceLocation`` and ``SourceManager`` classes 572---------------------------------------------------- 573 574Strangely enough, the ``SourceLocation`` class represents a location within the 575source code of the program. Important design points include: 576 577#. ``sizeof(SourceLocation)`` must be extremely small, as these are embedded 578 into many AST nodes and are passed around often. Currently it is 32 bits. 579#. ``SourceLocation`` must be a simple value object that can be efficiently 580 copied. 581#. We should be able to represent a source location for any byte of any input 582 file. This includes in the middle of tokens, in whitespace, in trigraphs, 583 etc. 584#. A ``SourceLocation`` must encode the current ``#include`` stack that was 585 active when the location was processed. For example, if the location 586 corresponds to a token, it should contain the set of ``#include``\ s active 587 when the token was lexed. This allows us to print the ``#include`` stack 588 for a diagnostic. 589#. ``SourceLocation`` must be able to describe macro expansions, capturing both 590 the ultimate instantiation point and the source of the original character 591 data. 592 593In practice, the ``SourceLocation`` works together with the ``SourceManager`` 594class to encode two pieces of information about a location: its spelling 595location and its expansion location. For most tokens, these will be the 596same. However, for a macro expansion (or tokens that came from a ``_Pragma`` 597directive) these will describe the location of the characters corresponding to 598the token and the location where the token was used (i.e., the macro 599expansion point or the location of the ``_Pragma`` itself). 600 601The Clang front-end inherently depends on the location of a token being tracked 602correctly. If it is ever incorrect, the front-end may get confused and die. 603The reason for this is that the notion of the "spelling" of a ``Token`` in 604Clang depends on being able to find the original input characters for the 605token. This concept maps directly to the "spelling location" for the token. 606 607``SourceRange`` and ``CharSourceRange`` 608--------------------------------------- 609 610.. mostly taken from https://discourse.llvm.org/t/code-ranges-of-tokens-ast-elements/16893/2 611 612Clang represents most source ranges by [first, last], where "first" and "last" 613each point to the beginning of their respective tokens. For example consider 614the ``SourceRange`` of the following statement: 615 616.. code-block:: text 617 618 x = foo + bar; 619 ^first ^last 620 621To map from this representation to a character-based representation, the "last" 622location needs to be adjusted to point to (or past) the end of that token with 623either ``Lexer::MeasureTokenLength()`` or ``Lexer::getLocForEndOfToken()``. For 624the rare cases where character-level source ranges information is needed we use 625the ``CharSourceRange`` class. 626 627The Driver Library 628================== 629 630The clang Driver and library are documented :doc:`here <DriverInternals>`. 631 632Precompiled Headers 633=================== 634 635Clang supports precompiled headers (:doc:`PCH <PCHInternals>`), which uses a 636serialized representation of Clang's internal data structures, encoded with the 637`LLVM bitstream format <https://llvm.org/docs/BitCodeFormat.html>`_. 638 639The Frontend Library 640==================== 641 642The Frontend library contains functionality useful for building tools on top of 643the Clang libraries, for example several methods for outputting diagnostics. 644 645Compiler Invocation 646------------------- 647 648One of the classes provided by the Frontend library is ``CompilerInvocation``, 649which holds information that describe current invocation of the Clang ``-cc1`` 650frontend. The information typically comes from the command line constructed by 651the Clang driver or from clients performing custom initialization. The data 652structure is split into logical units used by different parts of the compiler, 653for example ``PreprocessorOptions``, ``LanguageOptions`` or ``CodeGenOptions``. 654 655Command Line Interface 656---------------------- 657 658The command line interface of the Clang ``-cc1`` frontend is defined alongside 659the driver options in ``clang/Driver/Options.td``. The information making up an 660option definition includes its prefix and name (for example ``-std=``), form and 661position of the option value, help text, aliases and more. Each option may 662belong to a certain group and can be marked with zero or more flags. Options 663accepted by the ``-cc1`` frontend are marked with the ``CC1Option`` flag. 664 665Command Line Parsing 666-------------------- 667 668Option definitions are processed by the ``-gen-opt-parser-defs`` tablegen 669backend during early stages of the build. Options are then used for querying an 670instance ``llvm::opt::ArgList``, a wrapper around the command line arguments. 671This is done in the Clang driver to construct individual jobs based on the 672driver arguments and also in the ``CompilerInvocation::CreateFromArgs`` function 673that parses the ``-cc1`` frontend arguments. 674 675Command Line Generation 676----------------------- 677 678Any valid ``CompilerInvocation`` created from a ``-cc1`` command line can be 679also serialized back into semantically equivalent command line in a 680deterministic manner. This enables features such as implicitly discovered, 681explicitly built modules. 682 683.. 684 TODO: Create and link corresponding section in Modules.rst. 685 686Adding new Command Line Option 687------------------------------ 688 689When adding a new command line option, the first place of interest is the header 690file declaring the corresponding options class (e.g. ``CodeGenOptions.h`` for 691command line option that affects the code generation). Create new member 692variable for the option value: 693 694.. code-block:: diff 695 696 class CodeGenOptions : public CodeGenOptionsBase { 697 698 + /// List of dynamic shared object files to be loaded as pass plugins. 699 + std::vector<std::string> PassPlugins; 700 701 } 702 703Next, declare the command line interface of the option in the tablegen file 704``clang/include/clang/Driver/Options.td``. This is done by instantiating the 705``Option`` class (defined in ``llvm/include/llvm/Option/OptParser.td``). The 706instance is typically created through one of the helper classes that encode the 707acceptable ways to specify the option value on the command line: 708 709* ``Flag`` - the option does not accept any value, 710* ``Joined`` - the value must immediately follow the option name within the same 711 argument, 712* ``Separate`` - the value must follow the option name in the next command line 713 argument, 714* ``JoinedOrSeparate`` - the value can be specified either as ``Joined`` or 715 ``Separate``, 716* ``CommaJoined`` - the values are comma-separated and must immediately follow 717 the option name within the same argument (see ``Wl,`` for an example). 718 719The helper classes take a list of acceptable prefixes of the option (e.g. 720``"-"``, ``"--"`` or ``"/"``) and the option name: 721 722.. code-block:: diff 723 724 // Options.td 725 726 + def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">; 727 728Then, specify additional attributes via mix-ins: 729 730* ``HelpText`` holds the text that will be printed besides the option name when 731 the user requests help (e.g. via ``clang --help``). 732* ``Group`` specifies the "category" of options this option belongs to. This is 733 used by various tools to categorize and sometimes filter options. 734* ``Flags`` may contain "tags" associated with the option. These may affect how 735 the option is rendered, or if it's hidden in some contexts. 736* ``Visibility`` should be used to specify the drivers in which a particular 737 option would be available. This attribute will impact tool --help 738* ``Alias`` denotes that the option is an alias of another option. This may be 739 combined with ``AliasArgs`` that holds the implied value. 740 741.. code-block:: diff 742 743 // Options.td 744 745 def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">, 746 + Group<f_Group>, Visibility<[ClangOption, CC1Option]>, 747 + HelpText<"Load pass plugin from a dynamic shared object file.">; 748 749New options are recognized by the ``clang`` driver mode if ``Visibility`` is 750not specified or contains ``ClangOption``. Options intended for ``clang -cc1`` 751must be explicitly marked with the ``CC1Option`` flag. Flags that specify 752``CC1Option`` but not ``ClangOption`` will only be accessible via ``-cc1``. 753This is similar for other driver modes, such as ``clang-cl`` or ``flang``. 754 755Next, parse (or manufacture) the command line arguments in the Clang driver and 756use them to construct the ``-cc1`` job: 757 758.. code-block:: diff 759 760 void Clang::ConstructJob(const ArgList &Args /*...*/) const { 761 ArgStringList CmdArgs; 762 // ... 763 764 + for (const Arg *A : Args.filtered(OPT_fpass_plugin_EQ)) { 765 + CmdArgs.push_back(Args.MakeArgString(Twine("-fpass-plugin=") + A->getValue())); 766 + A->claim(); 767 + } 768 } 769 770The last step is implementing the ``-cc1`` command line argument 771parsing/generation that initializes/serializes the option class (in our case 772``CodeGenOptions``) stored within ``CompilerInvocation``. This can be done 773automatically by using the marshalling annotations on the option definition: 774 775.. code-block:: diff 776 777 // Options.td 778 779 def fpass_plugin_EQ : Joined<["-"], "fpass-plugin=">, 780 Group<f_Group>, Flags<[CC1Option]>, 781 HelpText<"Load pass plugin from a dynamic shared object file.">, 782 + MarshallingInfoStringVector<CodeGenOpts<"PassPlugins">>; 783 784Inner workings of the system are introduced in the :ref:`marshalling 785infrastructure <OptionMarshalling>` section and the available annotations are 786listed :ref:`here <OptionMarshallingAnnotations>`. 787 788In case the marshalling infrastructure does not support the desired semantics, 789consider simplifying it to fit the existing model. This makes the command line 790more uniform and reduces the amount of custom, manually written code. Remember 791that the ``-cc1`` command line interface is intended only for Clang developers, 792meaning it does not need to mirror the driver interface, maintain backward 793compatibility or be compatible with GCC. 794 795If the option semantics cannot be encoded via marshalling annotations, you can 796resort to parsing/serializing the command line arguments manually: 797 798.. code-block:: diff 799 800 // CompilerInvocation.cpp 801 802 static bool ParseCodeGenArgs(CodeGenOptions &Opts, ArgList &Args /*...*/) { 803 // ... 804 805 + Opts.PassPlugins = Args.getAllArgValues(OPT_fpass_plugin_EQ); 806 } 807 808 static void GenerateCodeGenArgs(const CodeGenOptions &Opts, 809 SmallVectorImpl<const char *> &Args, 810 CompilerInvocation::StringAllocator SA /*...*/) { 811 // ... 812 813 + for (const std::string &PassPlugin : Opts.PassPlugins) 814 + GenerateArg(Args, OPT_fpass_plugin_EQ, PassPlugin, SA); 815 } 816 817Finally, you can specify the argument on the command line: 818``clang -fpass-plugin=a -fpass-plugin=b`` and use the new member variable as 819desired. 820 821.. code-block:: diff 822 823 void EmitAssemblyHelper::EmitAssemblyWithNewPassManager(/*...*/) { 824 // ... 825 + for (auto &PluginFN : CodeGenOpts.PassPlugins) 826 + if (auto PassPlugin = PassPlugin::Load(PluginFN)) 827 + PassPlugin->registerPassBuilderCallbacks(PB); 828 } 829 830.. _OptionMarshalling: 831 832Option Marshalling Infrastructure 833--------------------------------- 834 835The option marshalling infrastructure automates the parsing of the Clang 836``-cc1`` frontend command line arguments into ``CompilerInvocation`` and their 837generation from ``CompilerInvocation``. The system replaces lots of repetitive 838C++ code with simple, declarative tablegen annotations and it's being used for 839the majority of the ``-cc1`` command line interface. This section provides an 840overview of the system. 841 842**Note:** The marshalling infrastructure is not intended for driver-only 843options. Only options of the ``-cc1`` frontend need to be marshalled to/from 844``CompilerInvocation`` instance. 845 846To read and modify contents of ``CompilerInvocation``, the marshalling system 847uses key paths, which are declared in two steps. First, a tablegen definition 848for the ``CompilerInvocation`` member is created by inheriting from 849``KeyPathAndMacro``: 850 851.. code-block:: text 852 853 // Options.td 854 855 class LangOpts<string field> : KeyPathAndMacro<"LangOpts->", field, "LANG_"> {} 856 // CompilerInvocation member ^^^^^^^^^^ 857 // OPTION_WITH_MARSHALLING prefix ^^^^^ 858 859The first argument to the parent class is the beginning of the key path that 860references the ``CompilerInvocation`` member. This argument ends with ``->`` if 861the member is a pointer type or with ``.`` if it's a value type. The child class 862takes a single parameter ``field`` that is forwarded as the second argument to 863the base class. The child class can then be used like so: 864``LangOpts<"IgnoreExceptions">``, constructing a key path to the field 865``LangOpts->IgnoreExceptions``. The third argument passed to the parent class is 866a string that the tablegen backend uses as a prefix to the 867``OPTION_WITH_MARSHALLING`` macro. Using the key path as a mix-in on an 868``Option`` instance instructs the backend to generate the following code: 869 870.. code-block:: c++ 871 872 // Options.inc 873 874 #ifdef LANG_OPTION_WITH_MARSHALLING 875 LANG_OPTION_WITH_MARSHALLING([...], LangOpts->IgnoreExceptions, [...]) 876 #endif // LANG_OPTION_WITH_MARSHALLING 877 878Such definition can be used used in the function for parsing and generating 879command line: 880 881.. code-block:: c++ 882 883 // clang/lib/Frontend/CompilerInvoation.cpp 884 885 bool CompilerInvocation::ParseLangArgs(LangOptions *LangOpts, ArgList &Args, 886 DiagnosticsEngine &Diags) { 887 bool Success = true; 888 889 #define LANG_OPTION_WITH_MARSHALLING( \ 890 PREFIX_TYPE, NAME, ID, KIND, GROUP, ALIAS, ALIASARGS, FLAGS, PARAM, \ 891 HELPTEXT, METAVAR, VALUES, SPELLING, SHOULD_PARSE, ALWAYS_EMIT, KEYPATH, \ 892 DEFAULT_VALUE, IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, DENORMALIZER, \ 893 MERGER, EXTRACTOR, TABLE_INDEX) \ 894 PARSE_OPTION_WITH_MARSHALLING(Args, Diags, Success, ID, FLAGS, PARAM, \ 895 SHOULD_PARSE, KEYPATH, DEFAULT_VALUE, \ 896 IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, \ 897 MERGER, TABLE_INDEX) 898 #include "clang/Driver/Options.inc" 899 #undef LANG_OPTION_WITH_MARSHALLING 900 901 // ... 902 903 return Success; 904 } 905 906 void CompilerInvocation::GenerateLangArgs(LangOptions *LangOpts, 907 SmallVectorImpl<const char *> &Args, 908 StringAllocator SA) { 909 #define LANG_OPTION_WITH_MARSHALLING( \ 910 PREFIX_TYPE, NAME, ID, KIND, GROUP, ALIAS, ALIASARGS, FLAGS, PARAM, \ 911 HELPTEXT, METAVAR, VALUES, SPELLING, SHOULD_PARSE, ALWAYS_EMIT, KEYPATH, \ 912 DEFAULT_VALUE, IMPLIED_CHECK, IMPLIED_VALUE, NORMALIZER, DENORMALIZER, \ 913 MERGER, EXTRACTOR, TABLE_INDEX) \ 914 GENERATE_OPTION_WITH_MARSHALLING( \ 915 Args, SA, KIND, FLAGS, SPELLING, ALWAYS_EMIT, KEYPATH, DEFAULT_VALUE, \ 916 IMPLIED_CHECK, IMPLIED_VALUE, DENORMALIZER, EXTRACTOR, TABLE_INDEX) 917 #include "clang/Driver/Options.inc" 918 #undef LANG_OPTION_WITH_MARSHALLING 919 920 // ... 921 } 922 923The ``PARSE_OPTION_WITH_MARSHALLING`` and ``GENERATE_OPTION_WITH_MARSHALLING`` 924macros are defined in ``CompilerInvocation.cpp`` and they implement the generic 925algorithm for parsing and generating command line arguments. 926 927.. _OptionMarshallingAnnotations: 928 929Option Marshalling Annotations 930------------------------------ 931 932How does the tablegen backend know what to put in place of ``[...]`` in the 933generated ``Options.inc``? This is specified by the ``Marshalling`` utilities 934described below. All of them take a key path argument and possibly other 935information required for parsing or generating the command line argument. 936 937**Note:** The marshalling infrastructure is not intended for driver-only 938options. Only options of the ``-cc1`` frontend need to be marshalled to/from 939``CompilerInvocation`` instance. 940 941**Positive Flag** 942 943The key path defaults to ``false`` and is set to ``true`` when the flag is 944present on command line. 945 946.. code-block:: text 947 948 def fignore_exceptions : Flag<["-"], "fignore-exceptions">, 949 Visibility<[ClangOption, CC1Option]>, 950 MarshallingInfoFlag<LangOpts<"IgnoreExceptions">>; 951 952**Negative Flag** 953 954The key path defaults to ``true`` and is set to ``false`` when the flag is 955present on command line. 956 957.. code-block:: text 958 959 def fno_verbose_asm : Flag<["-"], "fno-verbose-asm">, 960 Visibility<[ClangOption, CC1Option]>, 961 MarshallingInfoNegativeFlag<CodeGenOpts<"AsmVerbose">>; 962 963**Negative and Positive Flag** 964 965The key path defaults to the specified value (``false``, ``true`` or some 966boolean value that's statically unknown in the tablegen file). Then, the key 967path is set to the value associated with the flag that appears last on command 968line. 969 970.. code-block:: text 971 972 defm legacy_pass_manager : BoolOption<"f", "legacy-pass-manager", 973 CodeGenOpts<"LegacyPassManager">, DefaultFalse, 974 PosFlag<SetTrue, [], [], "Use the legacy pass manager in LLVM">, 975 NegFlag<SetFalse, [], [], "Use the new pass manager in LLVM">, 976 BothFlags<[], [ClangOption, CC1Option]>>; 977 978With most such pair of flags, the ``-cc1`` frontend accepts only the flag that 979changes the default key path value. The Clang driver is responsible for 980accepting both and either forwarding the changing flag or discarding the flag 981that would just set the key path to its default. 982 983The first argument to ``BoolOption`` is a prefix that is used to construct the 984full names of both flags. The positive flag would then be named 985``flegacy-pass-manager`` and the negative ``fno-legacy-pass-manager``. 986``BoolOption`` also implies the ``-`` prefix for both flags. It's also possible 987to use ``BoolFOption`` that implies the ``"f"`` prefix and ``Group<f_Group>``. 988The ``PosFlag`` and ``NegFlag`` classes hold the associated boolean value, 989arrays of elements passed to the ``Flag`` and ``Visibility`` classes and the 990help text. The optional ``BothFlags`` class holds arrays of ``Flag`` and 991``Visibility`` elements that are common for both the positive and negative flag 992and their common help text suffix. 993 994**String** 995 996The key path defaults to the specified string, or an empty one, if omitted. When 997the option appears on the command line, the argument value is simply copied. 998 999.. code-block:: text 1000 1001 def isysroot : JoinedOrSeparate<["-"], "isysroot">, 1002 Visibility<[ClangOption, CC1Option, FlangOption]>, 1003 MarshallingInfoString<HeaderSearchOpts<"Sysroot">, [{"/"}]>; 1004 1005**List of Strings** 1006 1007The key path defaults to an empty ``std::vector<std::string>``. Values specified 1008with each appearance of the option on the command line are appended to the 1009vector. 1010 1011.. code-block:: text 1012 1013 def frewrite_map_file : Separate<["-"], "frewrite-map-file">, 1014 Visibility<[ClangOption, CC1Option]>, 1015 MarshallingInfoStringVector<CodeGenOpts<"RewriteMapFiles">>; 1016 1017**Integer** 1018 1019The key path defaults to the specified integer value, or ``0`` if omitted. When 1020the option appears on the command line, its value gets parsed by ``llvm::APInt`` 1021and the result is assigned to the key path on success. 1022 1023.. code-block:: text 1024 1025 def mstack_probe_size : Joined<["-"], "mstack-probe-size=">, 1026 Visibility<[ClangOption, CC1Option]>, 1027 MarshallingInfoInt<CodeGenOpts<"StackProbeSize">, "4096">; 1028 1029**Enumeration** 1030 1031The key path defaults to the value specified in ``MarshallingInfoEnum`` prefixed 1032by the contents of ``NormalizedValuesScope`` and ``::``. This ensures correct 1033reference to an enum case is formed even if the enum resides in different 1034namespace or is an enum class. If the value present on command line does not 1035match any of the comma-separated values from ``Values``, an error diagnostics is 1036issued. Otherwise, the corresponding element from ``NormalizedValues`` at the 1037same index is assigned to the key path (also correctly scoped). The number of 1038comma-separated string values and elements of the array within 1039``NormalizedValues`` must match. 1040 1041.. code-block:: text 1042 1043 def mthread_model : Separate<["-"], "mthread-model">, 1044 Visibility<[ClangOption, CC1Option]>, 1045 Values<"posix,single">, NormalizedValues<["POSIX", "Single"]>, 1046 NormalizedValuesScope<"LangOptions::ThreadModelKind">, 1047 MarshallingInfoEnum<LangOpts<"ThreadModel">, "POSIX">; 1048 1049.. 1050 Intentionally omitting MarshallingInfoBitfieldFlag. It's adding some 1051 complexity to the marshalling infrastructure and might be removed. 1052 1053It is also possible to define relationships between options. 1054 1055**Implication** 1056 1057The key path defaults to the default value from the primary ``Marshalling`` 1058annotation. Then, if any of the elements of ``ImpliedByAnyOf`` evaluate to true, 1059the key path value is changed to the specified value or ``true`` if missing. 1060Finally, the command line is parsed according to the primary annotation. 1061 1062.. code-block:: text 1063 1064 def fms_extensions : Flag<["-"], "fms-extensions">, 1065 Visibility<[ClangOption, CC1Option]>, 1066 MarshallingInfoFlag<LangOpts<"MicrosoftExt">>, 1067 ImpliedByAnyOf<[fms_compatibility.KeyPath], "true">; 1068 1069**Condition** 1070 1071The option is parsed only if the expression in ``ShouldParseIf`` evaluates to 1072true. 1073 1074.. code-block:: text 1075 1076 def fopenmp_enable_irbuilder : Flag<["-"], "fopenmp-enable-irbuilder">, 1077 Visibility<[ClangOption, CC1Option]>, 1078 MarshallingInfoFlag<LangOpts<"OpenMPIRBuilder">>, 1079 ShouldParseIf<fopenmp.KeyPath>; 1080 1081The Lexer and Preprocessor Library 1082================================== 1083 1084The Lexer library contains several tightly-connected classes that are involved 1085with the nasty process of lexing and preprocessing C source code. The main 1086interface to this library for outside clients is the large ``Preprocessor`` 1087class. It contains the various pieces of state that are required to coherently 1088read tokens out of a translation unit. 1089 1090The core interface to the ``Preprocessor`` object (once it is set up) is the 1091``Preprocessor::Lex`` method, which returns the next :ref:`Token <Token>` from 1092the preprocessor stream. There are two types of token providers that the 1093preprocessor is capable of reading from: a buffer lexer (provided by the 1094:ref:`Lexer <Lexer>` class) and a buffered token stream (provided by the 1095:ref:`TokenLexer <TokenLexer>` class). 1096 1097.. _Token: 1098 1099The Token class 1100--------------- 1101 1102The ``Token`` class is used to represent a single lexed token. Tokens are 1103intended to be used by the lexer/preprocess and parser libraries, but are not 1104intended to live beyond them (for example, they should not live in the ASTs). 1105 1106Tokens most often live on the stack (or some other location that is efficient 1107to access) as the parser is running, but occasionally do get buffered up. For 1108example, macro definitions are stored as a series of tokens, and the C++ 1109front-end periodically needs to buffer tokens up for tentative parsing and 1110various pieces of look-ahead. As such, the size of a ``Token`` matters. On a 111132-bit system, ``sizeof(Token)`` is currently 16 bytes. 1112 1113Tokens occur in two forms: :ref:`annotation tokens <AnnotationToken>` and 1114normal tokens. Normal tokens are those returned by the lexer, annotation 1115tokens represent semantic information and are produced by the parser, replacing 1116normal tokens in the token stream. Normal tokens contain the following 1117information: 1118 1119* **A SourceLocation** --- This indicates the location of the start of the 1120 token. 1121 1122* **A length** --- This stores the length of the token as stored in the 1123 ``SourceBuffer``. For tokens that include them, this length includes 1124 trigraphs and escaped newlines which are ignored by later phases of the 1125 compiler. By pointing into the original source buffer, it is always possible 1126 to get the original spelling of a token completely accurately. 1127 1128* **IdentifierInfo** --- If a token takes the form of an identifier, and if 1129 identifier lookup was enabled when the token was lexed (e.g., the lexer was 1130 not reading in "raw" mode) this contains a pointer to the unique hash value 1131 for the identifier. Because the lookup happens before keyword 1132 identification, this field is set even for language keywords like "``for``". 1133 1134* **TokenKind** --- This indicates the kind of token as classified by the 1135 lexer. This includes things like ``tok::starequal`` (for the "``*=``" 1136 operator), ``tok::ampamp`` for the "``&&``" token, and keyword values (e.g., 1137 ``tok::kw_for``) for identifiers that correspond to keywords. Note that 1138 some tokens can be spelled multiple ways. For example, C++ supports 1139 "operator keywords", where things like "``and``" are treated exactly like the 1140 "``&&``" operator. In these cases, the kind value is set to ``tok::ampamp``, 1141 which is good for the parser, which doesn't have to consider both forms. For 1142 something that cares about which form is used (e.g., the preprocessor 1143 "stringize" operator) the spelling indicates the original form. 1144 1145* **Flags** --- There are currently four flags tracked by the 1146 lexer/preprocessor system on a per-token basis: 1147 1148 #. **StartOfLine** --- This was the first token that occurred on its input 1149 source line. 1150 #. **LeadingSpace** --- There was a space character either immediately before 1151 the token or transitively before the token as it was expanded through a 1152 macro. The definition of this flag is very closely defined by the 1153 stringizing requirements of the preprocessor. 1154 #. **DisableExpand** --- This flag is used internally to the preprocessor to 1155 represent identifier tokens which have macro expansion disabled. This 1156 prevents them from being considered as candidates for macro expansion ever 1157 in the future. 1158 #. **NeedsCleaning** --- This flag is set if the original spelling for the 1159 token includes a trigraph or escaped newline. Since this is uncommon, 1160 many pieces of code can fast-path on tokens that did not need cleaning. 1161 1162One interesting (and somewhat unusual) aspect of normal tokens is that they 1163don't contain any semantic information about the lexed value. For example, if 1164the token was a pp-number token, we do not represent the value of the number 1165that was lexed (this is left for later pieces of code to decide). 1166Additionally, the lexer library has no notion of typedef names vs variable 1167names: both are returned as identifiers, and the parser is left to decide 1168whether a specific identifier is a typedef or a variable (tracking this 1169requires scope information among other things). The parser can do this 1170translation by replacing tokens returned by the preprocessor with "Annotation 1171Tokens". 1172 1173.. _AnnotationToken: 1174 1175Annotation Tokens 1176----------------- 1177 1178Annotation tokens are tokens that are synthesized by the parser and injected 1179into the preprocessor's token stream (replacing existing tokens) to record 1180semantic information found by the parser. For example, if "``foo``" is found 1181to be a typedef, the "``foo``" ``tok::identifier`` token is replaced with an 1182``tok::annot_typename``. This is useful for a couple of reasons: 1) this makes 1183it easy to handle qualified type names (e.g., "``foo::bar::baz<42>::t``") in 1184C++ as a single "token" in the parser. 2) if the parser backtracks, the 1185reparse does not need to redo semantic analysis to determine whether a token 1186sequence is a variable, type, template, etc. 1187 1188Annotation tokens are created by the parser and reinjected into the parser's 1189token stream (when backtracking is enabled). Because they can only exist in 1190tokens that the preprocessor-proper is done with, it doesn't need to keep 1191around flags like "start of line" that the preprocessor uses to do its job. 1192Additionally, an annotation token may "cover" a sequence of preprocessor tokens 1193(e.g., "``a::b::c``" is five preprocessor tokens). As such, the valid fields 1194of an annotation token are different than the fields for a normal token (but 1195they are multiplexed into the normal ``Token`` fields): 1196 1197* **SourceLocation "Location"** --- The ``SourceLocation`` for the annotation 1198 token indicates the first token replaced by the annotation token. In the 1199 example above, it would be the location of the "``a``" identifier. 1200* **SourceLocation "AnnotationEndLoc"** --- This holds the location of the last 1201 token replaced with the annotation token. In the example above, it would be 1202 the location of the "``c``" identifier. 1203* **void* "AnnotationValue"** --- This contains an opaque object that the 1204 parser gets from ``Sema``. The parser merely preserves the information for 1205 ``Sema`` to later interpret based on the annotation token kind. 1206* **TokenKind "Kind"** --- This indicates the kind of Annotation token this is. 1207 See below for the different valid kinds. 1208 1209Annotation tokens currently come in three kinds: 1210 1211#. **tok::annot_typename**: This annotation token represents a resolved 1212 typename token that is potentially qualified. The ``AnnotationValue`` field 1213 contains the ``QualType`` returned by ``Sema::getTypeName()``, possibly with 1214 source location information attached. 1215#. **tok::annot_cxxscope**: This annotation token represents a C++ scope 1216 specifier, such as "``A::B::``". This corresponds to the grammar 1217 productions "*::*" and "*:: [opt] nested-name-specifier*". The 1218 ``AnnotationValue`` pointer is a ``NestedNameSpecifier *`` returned by the 1219 ``Sema::ActOnCXXGlobalScopeSpecifier`` and 1220 ``Sema::ActOnCXXNestedNameSpecifier`` callbacks. 1221#. **tok::annot_template_id**: This annotation token represents a C++ 1222 template-id such as "``foo<int, 4>``", where "``foo``" is the name of a 1223 template. The ``AnnotationValue`` pointer is a pointer to a ``malloc``'d 1224 ``TemplateIdAnnotation`` object. Depending on the context, a parsed 1225 template-id that names a type might become a typename annotation token (if 1226 all we care about is the named type, e.g., because it occurs in a type 1227 specifier) or might remain a template-id token (if we want to retain more 1228 source location information or produce a new type, e.g., in a declaration of 1229 a class template specialization). template-id annotation tokens that refer 1230 to a type can be "upgraded" to typename annotation tokens by the parser. 1231 1232As mentioned above, annotation tokens are not returned by the preprocessor, 1233they are formed on demand by the parser. This means that the parser has to be 1234aware of cases where an annotation could occur and form it where appropriate. 1235This is somewhat similar to how the parser handles Translation Phase 6 of C99: 1236String Concatenation (see C99 5.1.1.2). In the case of string concatenation, 1237the preprocessor just returns distinct ``tok::string_literal`` and 1238``tok::wide_string_literal`` tokens and the parser eats a sequence of them 1239wherever the grammar indicates that a string literal can occur. 1240 1241In order to do this, whenever the parser expects a ``tok::identifier`` or 1242``tok::coloncolon``, it should call the ``TryAnnotateTypeOrScopeToken`` or 1243``TryAnnotateCXXScopeToken`` methods to form the annotation token. These 1244methods will maximally form the specified annotation tokens and replace the 1245current token with them, if applicable. If the current tokens is not valid for 1246an annotation token, it will remain an identifier or "``::``" token. 1247 1248.. _Lexer: 1249 1250The ``Lexer`` class 1251------------------- 1252 1253The ``Lexer`` class provides the mechanics of lexing tokens out of a source 1254buffer and deciding what they mean. The ``Lexer`` is complicated by the fact 1255that it operates on raw buffers that have not had spelling eliminated (this is 1256a necessity to get decent performance), but this is countered with careful 1257coding as well as standard performance techniques (for example, the comment 1258handling code is vectorized on X86 and PowerPC hosts). 1259 1260The lexer has a couple of interesting modal features: 1261 1262* The lexer can operate in "raw" mode. This mode has several features that 1263 make it possible to quickly lex the file (e.g., it stops identifier lookup, 1264 doesn't specially handle preprocessor tokens, handles EOF differently, etc). 1265 This mode is used for lexing within an "``#if 0``" block, for example. 1266* The lexer can capture and return comments as tokens. This is required to 1267 support the ``-C`` preprocessor mode, which passes comments through, and is 1268 used by the diagnostic checker to identifier expect-error annotations. 1269* The lexer can be in ``ParsingFilename`` mode, which happens when 1270 preprocessing after reading a ``#include`` directive. This mode changes the 1271 parsing of "``<``" to return an "angled string" instead of a bunch of tokens 1272 for each thing within the filename. 1273* When parsing a preprocessor directive (after "``#``") the 1274 ``ParsingPreprocessorDirective`` mode is entered. This changes the parser to 1275 return EOD at a newline. 1276* The ``Lexer`` uses a ``LangOptions`` object to know whether trigraphs are 1277 enabled, whether C++ or ObjC keywords are recognized, etc. 1278 1279In addition to these modes, the lexer keeps track of a couple of other features 1280that are local to a lexed buffer, which change as the buffer is lexed: 1281 1282* The ``Lexer`` uses ``BufferPtr`` to keep track of the current character being 1283 lexed. 1284* The ``Lexer`` uses ``IsAtStartOfLine`` to keep track of whether the next 1285 lexed token will start with its "start of line" bit set. 1286* The ``Lexer`` keeps track of the current "``#if``" directives that are active 1287 (which can be nested). 1288* The ``Lexer`` keeps track of an :ref:`MultipleIncludeOpt 1289 <MultipleIncludeOpt>` object, which is used to detect whether the buffer uses 1290 the standard "``#ifndef XX`` / ``#define XX``" idiom to prevent multiple 1291 inclusion. If a buffer does, subsequent includes can be ignored if the 1292 "``XX``" macro is defined. 1293 1294.. _TokenLexer: 1295 1296The ``TokenLexer`` class 1297------------------------ 1298 1299The ``TokenLexer`` class is a token provider that returns tokens from a list of 1300tokens that came from somewhere else. It typically used for two things: 1) 1301returning tokens from a macro definition as it is being expanded 2) returning 1302tokens from an arbitrary buffer of tokens. The later use is used by 1303``_Pragma`` and will most likely be used to handle unbounded look-ahead for the 1304C++ parser. 1305 1306.. _MultipleIncludeOpt: 1307 1308The ``MultipleIncludeOpt`` class 1309-------------------------------- 1310 1311The ``MultipleIncludeOpt`` class implements a really simple little state 1312machine that is used to detect the standard "``#ifndef XX`` / ``#define XX``" 1313idiom that people typically use to prevent multiple inclusion of headers. If a 1314buffer uses this idiom and is subsequently ``#include``'d, the preprocessor can 1315simply check to see whether the guarding condition is defined or not. If so, 1316the preprocessor can completely ignore the include of the header. 1317 1318.. _Parser: 1319 1320The Parser Library 1321================== 1322 1323This library contains a recursive-descent parser that polls tokens from the 1324preprocessor and notifies a client of the parsing progress. 1325 1326Historically, the parser used to talk to an abstract ``Action`` interface that 1327had virtual methods for parse events, for example ``ActOnBinOp()``. When Clang 1328grew C++ support, the parser stopped supporting general ``Action`` clients -- 1329it now always talks to the :ref:`Sema library <Sema>`. However, the Parser 1330still accesses AST objects only through opaque types like ``ExprResult`` and 1331``StmtResult``. Only :ref:`Sema <Sema>` looks at the AST node contents of these 1332wrappers. 1333 1334.. _AST: 1335 1336The AST Library 1337=============== 1338 1339.. _ASTPhilosophy: 1340 1341Design philosophy 1342----------------- 1343 1344Immutability 1345^^^^^^^^^^^^ 1346 1347Clang AST nodes (types, declarations, statements, expressions, and so on) are 1348generally designed to be immutable once created. This provides a number of key 1349benefits: 1350 1351 * Canonicalization of the "meaning" of nodes is possible as soon as the nodes 1352 are created, and is not invalidated by later addition of more information. 1353 For example, we :ref:`canonicalize types <CanonicalType>`, and use a 1354 canonicalized representation of expressions when determining whether two 1355 function template declarations involving dependent expressions declare the 1356 same entity. 1357 * AST nodes can be reused when they have the same meaning. For example, we 1358 reuse ``Type`` nodes when representing the same type (but maintain separate 1359 ``TypeLoc``\s for each instance where a type is written), and we reuse 1360 non-dependent ``Stmt`` and ``Expr`` nodes across instantiations of a 1361 template. 1362 * Serialization and deserialization of the AST to/from AST files is simpler: 1363 we do not need to track modifications made to AST nodes imported from AST 1364 files and serialize separate "update records". 1365 1366There are unfortunately exceptions to this general approach, such as: 1367 1368 * The first declaration of a redeclarable entity maintains a pointer to the 1369 most recent declaration of that entity, which naturally needs to change as 1370 more declarations are parsed. 1371 * Name lookup tables in declaration contexts change after the namespace 1372 declaration is formed. 1373 * We attempt to maintain only a single declaration for an instantiation of a 1374 template, rather than having distinct declarations for an instantiation of 1375 the declaration versus the definition, so template instantiation often 1376 updates parts of existing declarations. 1377 * Some parts of declarations are required to be instantiated separately (this 1378 includes default arguments and exception specifications), and such 1379 instantiations update the existing declaration. 1380 1381These cases tend to be fragile; mutable AST state should be avoided where 1382possible. 1383 1384As a consequence of this design principle, we typically do not provide setters 1385for AST state. (Some are provided for short-term modifications intended to be 1386used immediately after an AST node is created and before it's "published" as 1387part of the complete AST, or where language semantics require after-the-fact 1388updates.) 1389 1390Faithfulness 1391^^^^^^^^^^^^ 1392 1393The AST intends to provide a representation of the program that is faithful to 1394the original source. We intend for it to be possible to write refactoring tools 1395using only information stored in, or easily reconstructible from, the Clang AST. 1396This means that the AST representation should either not desugar source-level 1397constructs to simpler forms, or -- where made necessary by language semantics 1398or a clear engineering tradeoff -- should desugar minimally and wrap the result 1399in a construct representing the original source form. 1400 1401For example, ``CXXForRangeStmt`` directly represents the syntactic form of a 1402range-based for statement, but also holds a semantic representation of the 1403range declaration and iterator declarations. It does not contain a 1404fully-desugared ``ForStmt``, however. 1405 1406Some AST nodes (for example, ``ParenExpr``) represent only syntax, and others 1407(for example, ``ImplicitCastExpr``) represent only semantics, but most nodes 1408will represent a combination of syntax and associated semantics. Inheritance 1409is typically used when representing different (but related) syntaxes for nodes 1410with the same or similar semantics. 1411 1412.. _Type: 1413 1414The ``Type`` class and its subclasses 1415------------------------------------- 1416 1417The ``Type`` class (and its subclasses) are an important part of the AST. 1418Types are accessed through the ``ASTContext`` class, which implicitly creates 1419and uniques them as they are needed. Types have a couple of non-obvious 1420features: 1) they do not capture type qualifiers like ``const`` or ``volatile`` 1421(see :ref:`QualType <QualType>`), and 2) they implicitly capture typedef 1422information. Once created, types are immutable (unlike decls). 1423 1424Typedefs in C make semantic analysis a bit more complex than it would be without 1425them. The issue is that we want to capture typedef information and represent it 1426in the AST perfectly, but the semantics of operations need to "see through" 1427typedefs. For example, consider this code: 1428 1429.. code-block:: c++ 1430 1431 void func() { 1432 typedef int foo; 1433 foo X, *Y; 1434 typedef foo *bar; 1435 bar Z; 1436 *X; // error 1437 **Y; // error 1438 **Z; // error 1439 } 1440 1441The code above is illegal, and thus we expect there to be diagnostics emitted 1442on the annotated lines. In this example, we expect to get: 1443 1444.. code-block:: text 1445 1446 test.c:6:1: error: indirection requires pointer operand ('foo' invalid) 1447 *X; // error 1448 ^~ 1449 test.c:7:1: error: indirection requires pointer operand ('foo' invalid) 1450 **Y; // error 1451 ^~~ 1452 test.c:8:1: error: indirection requires pointer operand ('foo' invalid) 1453 **Z; // error 1454 ^~~ 1455 1456While this example is somewhat silly, it illustrates the point: we want to 1457retain typedef information where possible, so that we can emit errors about 1458"``std::string``" instead of "``std::basic_string<char, std:...``". Doing this 1459requires properly keeping typedef information (for example, the type of ``X`` 1460is "``foo``", not "``int``"), and requires properly propagating it through the 1461various operators (for example, the type of ``*Y`` is "``foo``", not 1462"``int``"). In order to retain this information, the type of these expressions 1463is an instance of the ``TypedefType`` class, which indicates that the type of 1464these expressions is a typedef for "``foo``". 1465 1466Representing types like this is great for diagnostics, because the 1467user-specified type is always immediately available. There are two problems 1468with this: first, various semantic checks need to make judgements about the 1469*actual structure* of a type, ignoring typedefs. Second, we need an efficient 1470way to query whether two types are structurally identical to each other, 1471ignoring typedefs. The solution to both of these problems is the idea of 1472canonical types. 1473 1474.. _CanonicalType: 1475 1476Canonical Types 1477^^^^^^^^^^^^^^^ 1478 1479Every instance of the ``Type`` class contains a canonical type pointer. For 1480simple types with no typedefs involved (e.g., "``int``", "``int*``", 1481"``int**``"), the type just points to itself. For types that have a typedef 1482somewhere in their structure (e.g., "``foo``", "``foo*``", "``foo**``", 1483"``bar``"), the canonical type pointer points to their structurally equivalent 1484type without any typedefs (e.g., "``int``", "``int*``", "``int**``", and 1485"``int*``" respectively). 1486 1487This design provides a constant time operation (dereferencing the canonical type 1488pointer) that gives us access to the structure of types. For example, we can 1489trivially tell that "``bar``" and "``foo*``" are the same type by dereferencing 1490their canonical type pointers and doing a pointer comparison (they both point 1491to the single "``int*``" type). 1492 1493Canonical types and typedef types bring up some complexities that must be 1494carefully managed. Specifically, the ``isa``/``cast``/``dyn_cast`` operators 1495generally shouldn't be used in code that is inspecting the AST. For example, 1496when type checking the indirection operator (unary "``*``" on a pointer), the 1497type checker must verify that the operand has a pointer type. It would not be 1498correct to check that with "``isa<PointerType>(SubExpr->getType())``", because 1499this predicate would fail if the subexpression had a typedef type. 1500 1501The solution to this problem are a set of helper methods on ``Type``, used to 1502check their properties. In this case, it would be correct to use 1503"``SubExpr->getType()->isPointerType()``" to do the check. This predicate will 1504return true if the *canonical type is a pointer*, which is true any time the 1505type is structurally a pointer type. The only hard part here is remembering 1506not to use the ``isa``/``cast``/``dyn_cast`` operations. 1507 1508The second problem we face is how to get access to the pointer type once we 1509know it exists. To continue the example, the result type of the indirection 1510operator is the pointee type of the subexpression. In order to determine the 1511type, we need to get the instance of ``PointerType`` that best captures the 1512typedef information in the program. If the type of the expression is literally 1513a ``PointerType``, we can return that, otherwise we have to dig through the 1514typedefs to find the pointer type. For example, if the subexpression had type 1515"``foo*``", we could return that type as the result. If the subexpression had 1516type "``bar``", we want to return "``foo*``" (note that we do *not* want 1517"``int*``"). In order to provide all of this, ``Type`` has a 1518``getAsPointerType()`` method that checks whether the type is structurally a 1519``PointerType`` and, if so, returns the best one. If not, it returns a null 1520pointer. 1521 1522This structure is somewhat mystical, but after meditating on it, it will make 1523sense to you :). 1524 1525.. _QualType: 1526 1527The ``QualType`` class 1528---------------------- 1529 1530The ``QualType`` class is designed as a trivial value class that is small, 1531passed by-value and is efficient to query. The idea of ``QualType`` is that it 1532stores the type qualifiers (``const``, ``volatile``, ``restrict``, plus some 1533extended qualifiers required by language extensions) separately from the types 1534themselves. ``QualType`` is conceptually a pair of "``Type*``" and the bits 1535for these type qualifiers. 1536 1537By storing the type qualifiers as bits in the conceptual pair, it is extremely 1538efficient to get the set of qualifiers on a ``QualType`` (just return the field 1539of the pair), add a type qualifier (which is a trivial constant-time operation 1540that sets a bit), and remove one or more type qualifiers (just return a 1541``QualType`` with the bitfield set to empty). 1542 1543Further, because the bits are stored outside of the type itself, we do not need 1544to create duplicates of types with different sets of qualifiers (i.e. there is 1545only a single heap allocated "``int``" type: "``const int``" and "``volatile 1546const int``" both point to the same heap allocated "``int``" type). This 1547reduces the heap size used to represent bits and also means we do not have to 1548consider qualifiers when uniquing types (:ref:`Type <Type>` does not even 1549contain qualifiers). 1550 1551In practice, the two most common type qualifiers (``const`` and ``restrict``) 1552are stored in the low bits of the pointer to the ``Type`` object, together with 1553a flag indicating whether extended qualifiers are present (which must be 1554heap-allocated). This means that ``QualType`` is exactly the same size as a 1555pointer. 1556 1557.. _DeclarationName: 1558 1559Declaration names 1560----------------- 1561 1562The ``DeclarationName`` class represents the name of a declaration in Clang. 1563Declarations in the C family of languages can take several different forms. 1564Most declarations are named by simple identifiers, e.g., "``f``" and "``x``" in 1565the function declaration ``f(int x)``. In C++, declaration names can also name 1566class constructors ("``Class``" in ``struct Class { Class(); }``), class 1567destructors ("``~Class``"), overloaded operator names ("``operator+``"), and 1568conversion functions ("``operator void const *``"). In Objective-C, 1569declaration names can refer to the names of Objective-C methods, which involve 1570the method name and the parameters, collectively called a *selector*, e.g., 1571"``setWidth:height:``". Since all of these kinds of entities --- variables, 1572functions, Objective-C methods, C++ constructors, destructors, and operators 1573--- are represented as subclasses of Clang's common ``NamedDecl`` class, 1574``DeclarationName`` is designed to efficiently represent any kind of name. 1575 1576Given a ``DeclarationName`` ``N``, ``N.getNameKind()`` will produce a value 1577that describes what kind of name ``N`` stores. There are 10 options (all of 1578the names are inside the ``DeclarationName`` class). 1579 1580``Identifier`` 1581 1582 The name is a simple identifier. Use ``N.getAsIdentifierInfo()`` to retrieve 1583 the corresponding ``IdentifierInfo*`` pointing to the actual identifier. 1584 1585``ObjCZeroArgSelector``, ``ObjCOneArgSelector``, ``ObjCMultiArgSelector`` 1586 1587 The name is an Objective-C selector, which can be retrieved as a ``Selector`` 1588 instance via ``N.getObjCSelector()``. The three possible name kinds for 1589 Objective-C reflect an optimization within the ``DeclarationName`` class: 1590 both zero- and one-argument selectors are stored as a masked 1591 ``IdentifierInfo`` pointer, and therefore require very little space, since 1592 zero- and one-argument selectors are far more common than multi-argument 1593 selectors (which use a different structure). 1594 1595``CXXConstructorName`` 1596 1597 The name is a C++ constructor name. Use ``N.getCXXNameType()`` to retrieve 1598 the :ref:`type <QualType>` that this constructor is meant to construct. The 1599 type is always the canonical type, since all constructors for a given type 1600 have the same name. 1601 1602``CXXDestructorName`` 1603 1604 The name is a C++ destructor name. Use ``N.getCXXNameType()`` to retrieve 1605 the :ref:`type <QualType>` whose destructor is being named. This type is 1606 always a canonical type. 1607 1608``CXXConversionFunctionName`` 1609 1610 The name is a C++ conversion function. Conversion functions are named 1611 according to the type they convert to, e.g., "``operator void const *``". 1612 Use ``N.getCXXNameType()`` to retrieve the type that this conversion function 1613 converts to. This type is always a canonical type. 1614 1615``CXXOperatorName`` 1616 1617 The name is a C++ overloaded operator name. Overloaded operators are named 1618 according to their spelling, e.g., "``operator+``" or "``operator new []``". 1619 Use ``N.getCXXOverloadedOperator()`` to retrieve the overloaded operator (a 1620 value of type ``OverloadedOperatorKind``). 1621 1622``CXXLiteralOperatorName`` 1623 1624 The name is a C++11 user defined literal operator. User defined 1625 Literal operators are named according to the suffix they define, 1626 e.g., "``_foo``" for "``operator "" _foo``". Use 1627 ``N.getCXXLiteralIdentifier()`` to retrieve the corresponding 1628 ``IdentifierInfo*`` pointing to the identifier. 1629 1630``CXXUsingDirective`` 1631 1632 The name is a C++ using directive. Using directives are not really 1633 NamedDecls, in that they all have the same name, but they are 1634 implemented as such in order to store them in DeclContext 1635 effectively. 1636 1637``DeclarationName``\ s are cheap to create, copy, and compare. They require 1638only a single pointer's worth of storage in the common cases (identifiers, 1639zero- and one-argument Objective-C selectors) and use dense, uniqued storage 1640for the other kinds of names. Two ``DeclarationName``\ s can be compared for 1641equality (``==``, ``!=``) using a simple bitwise comparison, can be ordered 1642with ``<``, ``>``, ``<=``, and ``>=`` (which provide a lexicographical ordering 1643for normal identifiers but an unspecified ordering for other kinds of names), 1644and can be placed into LLVM ``DenseMap``\ s and ``DenseSet``\ s. 1645 1646``DeclarationName`` instances can be created in different ways depending on 1647what kind of name the instance will store. Normal identifiers 1648(``IdentifierInfo`` pointers) and Objective-C selectors (``Selector``) can be 1649implicitly converted to ``DeclarationNames``. Names for C++ constructors, 1650destructors, conversion functions, and overloaded operators can be retrieved 1651from the ``DeclarationNameTable``, an instance of which is available as 1652``ASTContext::DeclarationNames``. The member functions 1653``getCXXConstructorName``, ``getCXXDestructorName``, 1654``getCXXConversionFunctionName``, and ``getCXXOperatorName``, respectively, 1655return ``DeclarationName`` instances for the four kinds of C++ special function 1656names. 1657 1658.. _DeclContext: 1659 1660Declaration contexts 1661-------------------- 1662 1663Every declaration in a program exists within some *declaration context*, such 1664as a translation unit, namespace, class, or function. Declaration contexts in 1665Clang are represented by the ``DeclContext`` class, from which the various 1666declaration-context AST nodes (``TranslationUnitDecl``, ``NamespaceDecl``, 1667``RecordDecl``, ``FunctionDecl``, etc.) will derive. The ``DeclContext`` class 1668provides several facilities common to each declaration context: 1669 1670Source-centric vs. Semantics-centric View of Declarations 1671 1672 ``DeclContext`` provides two views of the declarations stored within a 1673 declaration context. The source-centric view accurately represents the 1674 program source code as written, including multiple declarations of entities 1675 where present (see the section :ref:`Redeclarations and Overloads 1676 <Redeclarations>`), while the semantics-centric view represents the program 1677 semantics. The two views are kept synchronized by semantic analysis while 1678 the ASTs are being constructed. 1679 1680Storage of declarations within that context 1681 1682 Every declaration context can contain some number of declarations. For 1683 example, a C++ class (represented by ``RecordDecl``) contains various member 1684 functions, fields, nested types, and so on. All of these declarations will 1685 be stored within the ``DeclContext``, and one can iterate over the 1686 declarations via [``DeclContext::decls_begin()``, 1687 ``DeclContext::decls_end()``). This mechanism provides the source-centric 1688 view of declarations in the context. 1689 1690Lookup of declarations within that context 1691 1692 The ``DeclContext`` structure provides efficient name lookup for names within 1693 that declaration context. For example, if ``N`` is a namespace we can look 1694 for the name ``N::f`` using ``DeclContext::lookup``. The lookup itself is 1695 based on a lazily-constructed array (for declaration contexts with a small 1696 number of declarations) or hash table (for declaration contexts with more 1697 declarations). The lookup operation provides the semantics-centric view of 1698 the declarations in the context. 1699 1700Ownership of declarations 1701 1702 The ``DeclContext`` owns all of the declarations that were declared within 1703 its declaration context, and is responsible for the management of their 1704 memory as well as their (de-)serialization. 1705 1706All declarations are stored within a declaration context, and one can query 1707information about the context in which each declaration lives. One can 1708retrieve the ``DeclContext`` that contains a particular ``Decl`` using 1709``Decl::getDeclContext``. However, see the section 1710:ref:`LexicalAndSemanticContexts` for more information about how to interpret 1711this context information. 1712 1713.. _Redeclarations: 1714 1715Redeclarations and Overloads 1716^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1717 1718Within a translation unit, it is common for an entity to be declared several 1719times. For example, we might declare a function "``f``" and then later 1720re-declare it as part of an inlined definition: 1721 1722.. code-block:: c++ 1723 1724 void f(int x, int y, int z = 1); 1725 1726 inline void f(int x, int y, int z) { /* ... */ } 1727 1728The representation of "``f``" differs in the source-centric and 1729semantics-centric views of a declaration context. In the source-centric view, 1730all redeclarations will be present, in the order they occurred in the source 1731code, making this view suitable for clients that wish to see the structure of 1732the source code. In the semantics-centric view, only the most recent "``f``" 1733will be found by the lookup, since it effectively replaces the first 1734declaration of "``f``". 1735 1736(Note that because ``f`` can be redeclared at block scope, or in a friend 1737declaration, etc. it is possible that the declaration of ``f`` found by name 1738lookup will not be the most recent one.) 1739 1740In the semantics-centric view, overloading of functions is represented 1741explicitly. For example, given two declarations of a function "``g``" that are 1742overloaded, e.g., 1743 1744.. code-block:: c++ 1745 1746 void g(); 1747 void g(int); 1748 1749the ``DeclContext::lookup`` operation will return a 1750``DeclContext::lookup_result`` that contains a range of iterators over 1751declarations of "``g``". Clients that perform semantic analysis on a program 1752that is not concerned with the actual source code will primarily use this 1753semantics-centric view. 1754 1755.. _LexicalAndSemanticContexts: 1756 1757Lexical and Semantic Contexts 1758^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1759 1760Each declaration has two potentially different declaration contexts: a 1761*lexical* context, which corresponds to the source-centric view of the 1762declaration context, and a *semantic* context, which corresponds to the 1763semantics-centric view. The lexical context is accessible via 1764``Decl::getLexicalDeclContext`` while the semantic context is accessible via 1765``Decl::getDeclContext``, both of which return ``DeclContext`` pointers. For 1766most declarations, the two contexts are identical. For example: 1767 1768.. code-block:: c++ 1769 1770 class X { 1771 public: 1772 void f(int x); 1773 }; 1774 1775Here, the semantic and lexical contexts of ``X::f`` are the ``DeclContext`` 1776associated with the class ``X`` (itself stored as a ``RecordDecl`` AST node). 1777However, we can now define ``X::f`` out-of-line: 1778 1779.. code-block:: c++ 1780 1781 void X::f(int x = 17) { /* ... */ } 1782 1783This definition of "``f``" has different lexical and semantic contexts. The 1784lexical context corresponds to the declaration context in which the actual 1785declaration occurred in the source code, e.g., the translation unit containing 1786``X``. Thus, this declaration of ``X::f`` can be found by traversing the 1787declarations provided by [``decls_begin()``, ``decls_end()``) in the 1788translation unit. 1789 1790The semantic context of ``X::f`` corresponds to the class ``X``, since this 1791member function is (semantically) a member of ``X``. Lookup of the name ``f`` 1792into the ``DeclContext`` associated with ``X`` will then return the definition 1793of ``X::f`` (including information about the default argument). 1794 1795Transparent Declaration Contexts 1796^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1797 1798In C and C++, there are several contexts in which names that are logically 1799declared inside another declaration will actually "leak" out into the enclosing 1800scope from the perspective of name lookup. The most obvious instance of this 1801behavior is in enumeration types, e.g., 1802 1803.. code-block:: c++ 1804 1805 enum Color { 1806 Red, 1807 Green, 1808 Blue 1809 }; 1810 1811Here, ``Color`` is an enumeration, which is a declaration context that contains 1812the enumerators ``Red``, ``Green``, and ``Blue``. Thus, traversing the list of 1813declarations contained in the enumeration ``Color`` will yield ``Red``, 1814``Green``, and ``Blue``. However, outside of the scope of ``Color`` one can 1815name the enumerator ``Red`` without qualifying the name, e.g., 1816 1817.. code-block:: c++ 1818 1819 Color c = Red; 1820 1821There are other entities in C++ that provide similar behavior. For example, 1822linkage specifications that use curly braces: 1823 1824.. code-block:: c++ 1825 1826 extern "C" { 1827 void f(int); 1828 void g(int); 1829 } 1830 // f and g are visible here 1831 1832For source-level accuracy, we treat the linkage specification and enumeration 1833type as a declaration context in which its enclosed declarations ("``Red``", 1834"``Green``", and "``Blue``"; "``f``" and "``g``") are declared. However, these 1835declarations are visible outside of the scope of the declaration context. 1836 1837These language features (and several others, described below) have roughly the 1838same set of requirements: declarations are declared within a particular lexical 1839context, but the declarations are also found via name lookup in scopes 1840enclosing the declaration itself. This feature is implemented via 1841*transparent* declaration contexts (see 1842``DeclContext::isTransparentContext()``), whose declarations are visible in the 1843nearest enclosing non-transparent declaration context. This means that the 1844lexical context of the declaration (e.g., an enumerator) will be the 1845transparent ``DeclContext`` itself, as will the semantic context, but the 1846declaration will be visible in every outer context up to and including the 1847first non-transparent declaration context (since transparent declaration 1848contexts can be nested). 1849 1850The transparent ``DeclContext``\ s are: 1851 1852* Enumerations (but not C++11 "scoped enumerations"): 1853 1854 .. code-block:: c++ 1855 1856 enum Color { 1857 Red, 1858 Green, 1859 Blue 1860 }; 1861 // Red, Green, and Blue are in scope 1862 1863* C++ linkage specifications: 1864 1865 .. code-block:: c++ 1866 1867 extern "C" { 1868 void f(int); 1869 void g(int); 1870 } 1871 // f and g are in scope 1872 1873* Anonymous unions and structs: 1874 1875 .. code-block:: c++ 1876 1877 struct LookupTable { 1878 bool IsVector; 1879 union { 1880 std::vector<Item> *Vector; 1881 std::set<Item> *Set; 1882 }; 1883 }; 1884 1885 LookupTable LT; 1886 LT.Vector = 0; // Okay: finds Vector inside the unnamed union 1887 1888* C++11 inline namespaces: 1889 1890 .. code-block:: c++ 1891 1892 namespace mylib { 1893 inline namespace debug { 1894 class X; 1895 } 1896 } 1897 mylib::X *xp; // okay: mylib::X refers to mylib::debug::X 1898 1899.. _MultiDeclContext: 1900 1901Multiply-Defined Declaration Contexts 1902^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1903 1904C++ namespaces have the interesting property that 1905the namespace can be defined multiple times, and the declarations provided by 1906each namespace definition are effectively merged (from the semantic point of 1907view). For example, the following two code snippets are semantically 1908indistinguishable: 1909 1910.. code-block:: c++ 1911 1912 // Snippet #1: 1913 namespace N { 1914 void f(); 1915 } 1916 namespace N { 1917 void f(int); 1918 } 1919 1920 // Snippet #2: 1921 namespace N { 1922 void f(); 1923 void f(int); 1924 } 1925 1926In Clang's representation, the source-centric view of declaration contexts will 1927actually have two separate ``NamespaceDecl`` nodes in Snippet #1, each of which 1928is a declaration context that contains a single declaration of "``f``". 1929However, the semantics-centric view provided by name lookup into the namespace 1930``N`` for "``f``" will return a ``DeclContext::lookup_result`` that contains a 1931range of iterators over declarations of "``f``". 1932 1933``DeclContext`` manages multiply-defined declaration contexts internally. The 1934function ``DeclContext::getPrimaryContext`` retrieves the "primary" context for 1935a given ``DeclContext`` instance, which is the ``DeclContext`` responsible for 1936maintaining the lookup table used for the semantics-centric view. Given a 1937DeclContext, one can obtain the set of declaration contexts that are 1938semantically connected to this declaration context, in source order, including 1939this context (which will be the only result, for non-namespace contexts) via 1940``DeclContext::collectAllContexts``. Note that these functions are used 1941internally within the lookup and insertion methods of the ``DeclContext``, so 1942the vast majority of clients can ignore them. 1943 1944Because the same entity can be defined multiple times in different modules, 1945it is also possible for there to be multiple definitions of (for instance) 1946a ``CXXRecordDecl``, all of which describe a definition of the same class. 1947In such a case, only one of those "definitions" is considered by Clang to be 1948the definition of the class, and the others are treated as non-defining 1949declarations that happen to also contain member declarations. Corresponding 1950members in each definition of such multiply-defined classes are identified 1951either by redeclaration chains (if the members are ``Redeclarable``) 1952or by simply a pointer to the canonical declaration (if the declarations 1953are not ``Redeclarable`` -- in that case, a ``Mergeable`` base class is used 1954instead). 1955 1956Error Handling 1957-------------- 1958 1959Clang produces an AST even when the code contains errors. Clang won't generate 1960and optimize code for it, but it's used as parsing continues to detect further 1961errors in the input. Clang-based tools also depend on such ASTs, and IDEs in 1962particular benefit from a high-quality AST for broken code. 1963 1964In presence of errors, clang uses a few error-recovery strategies to present the 1965broken code in the AST: 1966 1967- correcting errors: in cases where clang is confident about the fix, it 1968 provides a FixIt attaching to the error diagnostic and emits a corrected AST 1969 (reflecting the written code with FixIts applied). The advantage of that is to 1970 provide more accurate subsequent diagnostics. Typo correction is a typical 1971 example. 1972- representing invalid node: the invalid node is preserved in the AST in some 1973 form, e.g. when the "declaration" part of the declaration contains semantic 1974 errors, the Decl node is marked as invalid. 1975- dropping invalid node: this often happens for errors that we don’t have 1976 graceful recovery. Prior to Recovery AST, a mismatched-argument function call 1977 expression was dropped though a CallExpr was created for semantic analysis. 1978 1979With these strategies, clang surfaces better diagnostics, and provides AST 1980consumers a rich AST reflecting the written source code as much as possible even 1981for broken code. 1982 1983Recovery AST 1984^^^^^^^^^^^^ 1985 1986The idea of Recovery AST is to use recovery nodes which act as a placeholder to 1987maintain the rough structure of the parsing tree, preserve locations and 1988children but have no language semantics attached to them. 1989 1990For example, consider the following mismatched function call: 1991 1992.. code-block:: c++ 1993 1994 int NoArg(); 1995 void test(int abc) { 1996 NoArg(abc); // oops, mismatched function arguments. 1997 } 1998 1999Without Recovery AST, the invalid function call expression (and its child 2000expressions) would be dropped in the AST: 2001 2002:: 2003 2004 |-FunctionDecl <line:1:1, col:11> NoArg 'int ()' 2005 `-FunctionDecl <line:2:1, line:4:1> test 'void (int)' 2006 |-ParmVarDecl <col:11, col:15> col:15 used abc 'int' 2007 `-CompoundStmt <col:20, line:4:1> 2008 2009 2010With Recovery AST, the AST looks like: 2011 2012:: 2013 2014 |-FunctionDecl <line:1:1, col:11> NoArg 'int ()' 2015 `-FunctionDecl <line:2:1, line:4:1> test 'void (int)' 2016 |-ParmVarDecl <col:11, col:15> used abc 'int' 2017 `-CompoundStmt <col:20, line:4:1> 2018 `-RecoveryExpr <line:3:3, col:12> 'int' contains-errors 2019 |-UnresolvedLookupExpr <col:3> '<overloaded function type>' lvalue (ADL) = 'NoArg' 2020 `-DeclRefExpr <col:9> 'int' lvalue ParmVar 'abc' 'int' 2021 2022 2023An alternative is to use existing Exprs, e.g. CallExpr for the above example. 2024This would capture more call details (e.g. locations of parentheses) and allow 2025it to be treated uniformly with valid CallExprs. However, jamming the data we 2026have into CallExpr forces us to weaken its invariants, e.g. arg count may be 2027wrong. This would introduce a huge burden on consumers of the AST to handle such 2028"impossible" cases. So when we're representing (rather than correcting) errors, 2029we use a distinct recovery node type with extremely weak invariants instead. 2030 2031``RecoveryExpr`` is the only recovery node so far. In practice, broken decls 2032need more detailed semantics preserved (the current ``Invalid`` flag works 2033fairly well), and completely broken statements with interesting internal 2034structure are rare (so dropping the statements is OK). 2035 2036Types and dependence 2037^^^^^^^^^^^^^^^^^^^^ 2038 2039``RecoveryExpr`` is an ``Expr``, so it must have a type. In many cases the true 2040type can't really be known until the code is corrected (e.g. a call to a 2041function that doesn't exist). And it means that we can't properly perform type 2042checks on some containing constructs, such as ``return 42 + unknownFunction()``. 2043 2044To model this, we generalize the concept of dependence from C++ templates to 2045mean dependence on a template parameter or how an error is repaired. The 2046``RecoveryExpr`` ``unknownFunction()`` has the totally unknown type 2047``DependentTy``, and this suppresses type-based analysis in the same way it 2048would inside a template. 2049 2050In cases where we are confident about the concrete type (e.g. the return type 2051for a broken non-overloaded function call), the ``RecoveryExpr`` will have this 2052type. This allows more code to be typechecked, and produces a better AST and 2053more diagnostics. For example: 2054 2055.. code-block:: C++ 2056 2057 unknownFunction().size() // .size() is a CXXDependentScopeMemberExpr 2058 std::string(42).size() // .size() is a resolved MemberExpr 2059 2060Whether or not the ``RecoveryExpr`` has a dependent type, it is always 2061considered value-dependent, because its value isn't well-defined until the error 2062is resolved. Among other things, this means that clang doesn't emit more errors 2063where a RecoveryExpr is used as a constant (e.g. array size), but also won't try 2064to evaluate it. 2065 2066ContainsErrors bit 2067^^^^^^^^^^^^^^^^^^ 2068 2069Beyond the template dependence bits, we add a new “ContainsErrors” bit to 2070express “Does this expression or anything within it contain errors” semantic, 2071this bit is always set for RecoveryExpr, and propagated to other related nodes. 2072This provides a fast way to query whether any (recursive) child of an expression 2073had an error, which is often used to improve diagnostics. 2074 2075.. code-block:: C++ 2076 2077 // C++ 2078 void recoveryExpr(int abc) { 2079 unknownFunction(); // type-dependent, value-dependent, contains-errors 2080 2081 std::string(42).size(); // value-dependent, contains-errors, 2082 // not type-dependent, as we know the type is std::string 2083 } 2084 2085 2086.. code-block:: C 2087 2088 // C 2089 void recoveryExpr(int abc) { 2090 unknownVar + abc; // type-dependent, value-dependent, contains-errors 2091 } 2092 2093 2094The ASTImporter 2095--------------- 2096 2097The ``ASTImporter`` class imports nodes of an ``ASTContext`` into another 2098``ASTContext``. Please refer to the document :doc:`ASTImporter: Merging Clang 2099ASTs <LibASTImporter>` for an introduction. And please read through the 2100high-level `description of the import algorithm 2101<LibASTImporter.html#algorithm-of-the-import>`_, this is essential for 2102understanding further implementation details of the importer. 2103 2104.. _templated: 2105 2106Abstract Syntax Graph 2107^^^^^^^^^^^^^^^^^^^^^ 2108 2109Despite the name, the Clang AST is not a tree. It is a directed graph with 2110cycles. One example of a cycle is the connection between a 2111``ClassTemplateDecl`` and its "templated" ``CXXRecordDecl``. The *templated* 2112``CXXRecordDecl`` represents all the fields and methods inside the class 2113template, while the ``ClassTemplateDecl`` holds the information which is 2114related to being a template, i.e. template arguments, etc. We can get the 2115*templated* class (the ``CXXRecordDecl``) of a ``ClassTemplateDecl`` with 2116``ClassTemplateDecl::getTemplatedDecl()``. And we can get back a pointer of the 2117"described" class template from the *templated* class: 2118``CXXRecordDecl::getDescribedTemplate()``. So, this is a cycle between two 2119nodes: between the *templated* and the *described* node. There may be various 2120other kinds of cycles in the AST especially in case of declarations. 2121 2122.. _structural-eq: 2123 2124Structural Equivalency 2125^^^^^^^^^^^^^^^^^^^^^^ 2126 2127Importing one AST node copies that node into the destination ``ASTContext``. To 2128copy one node means that we create a new node in the "to" context then we set 2129its properties to be equal to the properties of the source node. Before the 2130copy, we make sure that the source node is not *structurally equivalent* to any 2131existing node in the destination context. If it happens to be equivalent then 2132we skip the copy. 2133 2134The informal definition of structural equivalency is the following: 2135Two nodes are **structurally equivalent** if they are 2136 2137- builtin types and refer to the same type, e.g. ``int`` and ``int`` are 2138 structurally equivalent, 2139- function types and all their parameters have structurally equivalent types, 2140- record types and all their fields in order of their definition have the same 2141 identifier names and structurally equivalent types, 2142- variable or function declarations and they have the same identifier name and 2143 their types are structurally equivalent. 2144 2145In C, two types are structurally equivalent if they are *compatible types*. For 2146a formal definition of *compatible types*, please refer to 6.2.7/1 in the C11 2147standard. However, there is no definition for *compatible types* in the C++ 2148standard. Still, we extend the definition of structural equivalency to 2149templates and their instantiations similarly: besides checking the previously 2150mentioned properties, we have to check for equivalent template 2151parameters/arguments, etc. 2152 2153The structural equivalent check can be and is used independently from the 2154ASTImporter, e.g. the ``clang::Sema`` class uses it also. 2155 2156The equivalence of nodes may depend on the equivalency of other pairs of nodes. 2157Thus, the check is implemented as a parallel graph traversal. We traverse 2158through the nodes of both graphs at the same time. The actual implementation is 2159similar to breadth-first-search. Let's say we start the traverse with the <A,B> 2160pair of nodes. Whenever the traversal reaches a pair <X,Y> then the following 2161statements are true: 2162 2163- A and X are nodes from the same ASTContext. 2164- B and Y are nodes from the same ASTContext. 2165- A and B may or may not be from the same ASTContext. 2166- if A == X and B == Y (pointer equivalency) then (there is a cycle during the 2167 traverse) 2168 2169 - A and B are structurally equivalent if and only if 2170 2171 - All dependent nodes on the path from <A,B> to <X,Y> are structurally 2172 equivalent. 2173 2174When we compare two classes or enums and one of them is incomplete or has 2175unloaded external lexical declarations then we cannot descend to compare their 2176contained declarations. So in these cases they are considered equal if they 2177have the same names. This is the way how we compare forward declarations with 2178definitions. 2179 2180.. TODO Should we elaborate the actual implementation of the graph traversal, 2181.. which is a very weird BFS traversal? 2182 2183Redeclaration Chains 2184^^^^^^^^^^^^^^^^^^^^ 2185 2186The early version of the ``ASTImporter``'s merge mechanism squashed the 2187declarations, i.e. it aimed to have only one declaration instead of maintaining 2188a whole redeclaration chain. This early approach simply skipped importing a 2189function prototype, but it imported a definition. To demonstrate the problem 2190with this approach let's consider an empty "to" context and the following 2191``virtual`` function declarations of ``f`` in the "from" context: 2192 2193.. code-block:: c++ 2194 2195 struct B { virtual void f(); }; 2196 void B::f() {} // <-- let's import this definition 2197 2198If we imported the definition with the "squashing" approach then we would 2199end-up having one declaration which is indeed a definition, but ``isVirtual()`` 2200returns ``false`` for it. The reason is that the definition is indeed not 2201virtual, it is the property of the prototype! 2202 2203Consequently, we must either set the virtual flag for the definition (but then 2204we create a malformed AST which the parser would never create), or we import 2205the whole redeclaration chain of the function. The most recent version of the 2206``ASTImporter`` uses the latter mechanism. We do import all function 2207declarations - regardless if they are definitions or prototypes - in the order 2208as they appear in the "from" context. 2209 2210.. One definition 2211 2212If we have an existing definition in the "to" context, then we cannot import 2213another definition, we will use the existing definition. However, we can import 2214prototype(s): we chain the newly imported prototype(s) to the existing 2215definition. Whenever we import a new prototype from a third context, that will 2216be added to the end of the redeclaration chain. This may result in long 2217redeclaration chains in certain cases, e.g. if we import from several 2218translation units which include the same header with the prototype. 2219 2220.. Squashing prototypes 2221 2222To mitigate the problem of long redeclaration chains of free functions, we 2223could compare prototypes to see if they have the same properties and if yes 2224then we could merge these prototypes. The implementation of squashing of 2225prototypes for free functions is future work. 2226 2227.. Exception: Cannot have more than 1 prototype in-class 2228 2229Chaining functions this way ensures that we do copy all information from the 2230source AST. Nonetheless, there is a problem with member functions: While we can 2231have many prototypes for free functions, we must have only one prototype for a 2232member function. 2233 2234.. code-block:: c++ 2235 2236 void f(); // OK 2237 void f(); // OK 2238 2239 struct X { 2240 void f(); // OK 2241 void f(); // ERROR 2242 }; 2243 void X::f() {} // OK 2244 2245Thus, prototypes of member functions must be squashed, we cannot just simply 2246attach a new prototype to the existing in-class prototype. Consider the 2247following contexts: 2248 2249.. code-block:: c++ 2250 2251 // "to" context 2252 struct X { 2253 void f(); // D0 2254 }; 2255 2256.. code-block:: c++ 2257 2258 // "from" context 2259 struct X { 2260 void f(); // D1 2261 }; 2262 void X::f() {} // D2 2263 2264When we import the prototype and the definition of ``f`` from the "from" 2265context, then the resulting redecl chain will look like this ``D0 -> D2'``, 2266where ``D2'`` is the copy of ``D2`` in the "to" context. 2267 2268.. Redecl chains of other declarations 2269 2270Generally speaking, when we import declarations (like enums and classes) we do 2271attach the newly imported declaration to the existing redeclaration chain (if 2272there is structural equivalency). We do not import, however, the whole 2273redeclaration chain as we do in case of functions. Up till now, we haven't 2274found any essential property of forward declarations which is similar to the 2275case of the virtual flag in a member function prototype. In the future, this 2276may change, though. 2277 2278Traversal during the Import 2279^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2280 2281The node specific import mechanisms are implemented in 2282``ASTNodeImporter::VisitNode()`` functions, e.g. ``VisitFunctionDecl()``. 2283When we import a declaration then first we import everything which is needed to 2284call the constructor of that declaration node. Everything which can be set 2285later is set after the node is created. For example, in case of a 2286``FunctionDecl`` we first import the declaration context in which the function 2287is declared, then we create the ``FunctionDecl`` and only then we import the 2288body of the function. This means there are implicit dependencies between AST 2289nodes. These dependencies determine the order in which we visit nodes in the 2290"from" context. As with the regular graph traversal algorithms like DFS, we 2291keep track which nodes we have already visited in 2292``ASTImporter::ImportedDecls``. Whenever we create a node then we immediately 2293add that to the ``ImportedDecls``. We must not start the import of any other 2294declarations before we keep track of the newly created one. This is essential, 2295otherwise, we would not be able to handle circular dependencies. To enforce 2296this, we wrap all constructor calls of all AST nodes in 2297``GetImportedOrCreateDecl()``. This wrapper ensures that all newly created 2298declarations are immediately marked as imported; also, if a declaration is 2299already marked as imported then we just return its counterpart in the "to" 2300context. Consequently, calling a declaration's ``::Create()`` function directly 2301would lead to errors, please don't do that! 2302 2303Even with the use of ``GetImportedOrCreateDecl()`` there is still a 2304probability of having an infinite import recursion if things are imported from 2305each other in wrong way. Imagine that during the import of ``A``, the import of 2306``B`` is requested before we could create the node for ``A`` (the constructor 2307needs a reference to ``B``). And the same could be true for the import of ``B`` 2308(``A`` is requested to be imported before we could create the node for ``B``). 2309In case of the :ref:`templated-described swing <templated>` we take 2310extra attention to break the cyclical dependency: we import and set the 2311described template only after the ``CXXRecordDecl`` is created. As a best 2312practice, before creating the node in the "to" context, avoid importing of 2313other nodes which are not needed for the constructor of node ``A``. 2314 2315Error Handling 2316^^^^^^^^^^^^^^ 2317 2318Every import function returns with either an ``llvm::Error`` or an 2319``llvm::Expected<T>`` object. This enforces to check the return value of the 2320import functions. If there was an error during one import then we return with 2321that error. (Exception: when we import the members of a class, we collect the 2322individual errors with each member and we concatenate them in one Error 2323object.) We cache these errors in cases of declarations. During the next import 2324call if there is an existing error we just return with that. So, clients of the 2325library receive an Error object, which they must check. 2326 2327During import of a specific declaration, it may happen that some AST nodes had 2328already been created before we recognize an error. In this case, we signal back 2329the error to the caller, but the "to" context remains polluted with those nodes 2330which had been created. Ideally, those nodes should not had been created, but 2331that time we did not know about the error, the error happened later. Since the 2332AST is immutable (most of the cases we can't remove existing nodes) we choose 2333to mark these nodes as erroneous. 2334 2335We cache the errors associated with declarations in the "from" context in 2336``ASTImporter::ImportDeclErrors`` and the ones which are associated with the 2337"to" context in ``ASTImporterSharedState::ImportErrors``. Note that, there may 2338be several ASTImporter objects which import into the same "to" context but from 2339different "from" contexts; in this case, they have to share the associated 2340errors of the "to" context. 2341 2342When an error happens, that propagates through the call stack, through all the 2343dependant nodes. However, in case of dependency cycles, this is not enough, 2344because we strive to mark the erroneous nodes so clients can act upon. In those 2345cases, we have to keep track of the errors for those nodes which are 2346intermediate nodes of a cycle. 2347 2348An **import path** is the list of the AST nodes which we visit during an Import 2349call. If node ``A`` depends on node ``B`` then the path contains an ``A->B`` 2350edge. From the call stack of the import functions, we can read the very same 2351path. 2352 2353Now imagine the following AST, where the ``->`` represents dependency in terms 2354of the import (all nodes are declarations). 2355 2356.. code-block:: text 2357 2358 A->B->C->D 2359 `->E 2360 2361We would like to import A. 2362The import behaves like a DFS, so we will visit the nodes in this order: ABCDE. 2363During the visitation we will have the following import paths: 2364 2365.. code-block:: text 2366 2367 A 2368 AB 2369 ABC 2370 ABCD 2371 ABC 2372 AB 2373 ABE 2374 AB 2375 A 2376 2377If during the visit of E there is an error then we set an error for E, then as 2378the call stack shrinks for B, then for A: 2379 2380.. code-block:: text 2381 2382 A 2383 AB 2384 ABC 2385 ABCD 2386 ABC 2387 AB 2388 ABE // Error! Set an error to E 2389 AB // Set an error to B 2390 A // Set an error to A 2391 2392However, during the import we could import C and D without any error and they 2393are independent of A,B and E. We must not set up an error for C and D. So, at 2394the end of the import we have an entry in ``ImportDeclErrors`` for A,B,E but 2395not for C,D. 2396 2397Now, what happens if there is a cycle in the import path? Let's consider this 2398AST: 2399 2400.. code-block:: text 2401 2402 A->B->C->A 2403 `->E 2404 2405During the visitation, we will have the below import paths and if during the 2406visit of E there is an error then we will set up an error for E,B,A. But what's 2407up with C? 2408 2409.. code-block:: text 2410 2411 A 2412 AB 2413 ABC 2414 ABCA 2415 ABC 2416 AB 2417 ABE // Error! Set an error to E 2418 AB // Set an error to B 2419 A // Set an error to A 2420 2421This time we know that both B and C are dependent on A. This means we must set 2422up an error for C too. As the call stack reverses back we get to A and we must 2423set up an error to all nodes which depend on A (this includes C). But C is no 2424longer on the import path, it just had been previously. Such a situation can 2425happen only if during the visitation we had a cycle. If we didn't have any 2426cycle, then the normal way of passing an Error object through the call stack 2427could handle the situation. This is why we must track cycles during the import 2428process for each visited declaration. 2429 2430Lookup Problems 2431^^^^^^^^^^^^^^^ 2432 2433When we import a declaration from the source context then we check whether we 2434already have a structurally equivalent node with the same name in the "to" 2435context. If the "from" node is a definition and the found one is also a 2436definition, then we do not create a new node, instead, we mark the found node 2437as the imported node. If the found definition and the one we want to import 2438have the same name but they are structurally in-equivalent, then we have an ODR 2439violation in case of C++. If the "from" node is not a definition then we add 2440that to the redeclaration chain of the found node. This behaviour is essential 2441when we merge ASTs from different translation units which include the same 2442header file(s). For example, we want to have only one definition for the class 2443template ``std::vector``, even if we included ``<vector>`` in several 2444translation units. 2445 2446To find a structurally equivalent node we can use the regular C/C++ lookup 2447functions: ``DeclContext::noload_lookup()`` and 2448``DeclContext::localUncachedLookup()``. These functions do respect the C/C++ 2449name hiding rules, thus you cannot find certain declarations in a given 2450declaration context. For instance, unnamed declarations (anonymous structs), 2451non-first ``friend`` declarations and template specializations are hidden. This 2452is a problem, because if we use the regular C/C++ lookup then we create 2453redundant AST nodes during the merge! Also, having two instances of the same 2454node could result in false :ref:`structural in-equivalencies <structural-eq>` 2455of other nodes which depend on the duplicated node. Because of these reasons, 2456we created a lookup class which has the sole purpose to register all 2457declarations, so later they can be looked up by subsequent import requests. 2458This is the ``ASTImporterLookupTable`` class. This lookup table should be 2459shared amongst the different ``ASTImporter`` instances if they happen to import 2460to the very same "to" context. This is why we can use the importer specific 2461lookup only via the ``ASTImporterSharedState`` class. 2462 2463ExternalASTSource 2464~~~~~~~~~~~~~~~~~ 2465 2466The ``ExternalASTSource`` is an abstract interface associated with the 2467``ASTContext`` class. It provides the ability to read the declarations stored 2468within a declaration context either for iteration or for name lookup. A 2469declaration context with an external AST source may load its declarations 2470on-demand. This means that the list of declarations (represented as a linked 2471list, the head is ``DeclContext::FirstDecl``) could be empty. However, member 2472functions like ``DeclContext::lookup()`` may initiate a load. 2473 2474Usually, external sources are associated with precompiled headers. For example, 2475when we load a class from a PCH then the members are loaded only if we do want 2476to look up something in the class' context. 2477 2478In case of LLDB, an implementation of the ``ExternalASTSource`` interface is 2479attached to the AST context which is related to the parsed expression. This 2480implementation of the ``ExternalASTSource`` interface is realized with the help 2481of the ``ASTImporter`` class. This way, LLDB can reuse Clang's parsing 2482machinery while synthesizing the underlying AST from the debug data (e.g. from 2483DWARF). From the view of the ``ASTImporter`` this means both the "to" and the 2484"from" context may have declaration contexts with external lexical storage. If 2485a ``DeclContext`` in the "to" AST context has external lexical storage then we 2486must take extra attention to work only with the already loaded declarations! 2487Otherwise, we would end up with an uncontrolled import process. For instance, 2488if we used the regular ``DeclContext::lookup()`` to find the existing 2489declarations in the "to" context then the ``lookup()`` call itself would 2490initiate a new import while we are in the middle of importing a declaration! 2491(By the time we initiate the lookup we haven't registered yet that we already 2492started to import the node of the "from" context.) This is why we use 2493``DeclContext::noload_lookup()`` instead. 2494 2495Class Template Instantiations 2496^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2497 2498Different translation units may have class template instantiations with the 2499same template arguments, but with a different set of instantiated 2500``MethodDecls`` and ``FieldDecls``. Consider the following files: 2501 2502.. code-block:: c++ 2503 2504 // x.h 2505 template <typename T> 2506 struct X { 2507 int a{0}; // FieldDecl with InitListExpr 2508 X(char) : a(3) {} // (1) 2509 X(int) {} // (2) 2510 }; 2511 2512 // foo.cpp 2513 void foo() { 2514 // ClassTemplateSpec with ctor (1): FieldDecl without InitlistExpr 2515 X<char> xc('c'); 2516 } 2517 2518 // bar.cpp 2519 void bar() { 2520 // ClassTemplateSpec with ctor (2): FieldDecl WITH InitlistExpr 2521 X<char> xc(1); 2522 } 2523 2524In ``foo.cpp`` we use the constructor with number ``(1)``, which explicitly 2525initializes the member ``a`` to ``3``, thus the ``InitListExpr`` ``{0}`` is not 2526used here and the AST node is not instantiated. However, in the case of 2527``bar.cpp`` we use the constructor with number ``(2)``, which does not 2528explicitly initialize the ``a`` member, so the default ``InitListExpr`` is 2529needed and thus instantiated. When we merge the AST of ``foo.cpp`` and 2530``bar.cpp`` we must create an AST node for the class template instantiation of 2531``X<char>`` which has all the required nodes. Therefore, when we find an 2532existing ``ClassTemplateSpecializationDecl`` then we merge the fields of the 2533``ClassTemplateSpecializationDecl`` in the "from" context in a way that the 2534``InitListExpr`` is copied if not existent yet. The same merge mechanism should 2535be done in the cases of instantiated default arguments and exception 2536specifications of functions. 2537 2538.. _visibility: 2539 2540Visibility of Declarations 2541^^^^^^^^^^^^^^^^^^^^^^^^^^ 2542 2543During import of a global variable with external visibility, the lookup will 2544find variables (with the same name) but with static visibility (linkage). 2545Clearly, we cannot put them into the same redeclaration chain. The same is true 2546the in case of functions. Also, we have to take care of other kinds of 2547declarations like enums, classes, etc. if they are in anonymous namespaces. 2548Therefore, we filter the lookup results and consider only those which have the 2549same visibility as the declaration we currently import. 2550 2551We consider two declarations in two anonymous namespaces to have the same 2552visibility only if they are imported from the same AST context. 2553 2554Strategies to Handle Conflicting Names 2555^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2556 2557During the import we lookup existing declarations with the same name. We filter 2558the lookup results based on their :ref:`visibility <visibility>`. If any of the 2559found declarations are not structurally equivalent then we bumped to a name 2560conflict error (ODR violation in C++). In this case, we return with an 2561``Error`` and we set up the ``Error`` object for the declaration. However, some 2562clients of the ``ASTImporter`` may require a different, perhaps less 2563conservative and more liberal error handling strategy. 2564 2565E.g. static analysis clients may benefit if the node is created even if there 2566is a name conflict. During the CTU analysis of certain projects, we recognized 2567that there are global declarations which collide with declarations from other 2568translation units, but they are not referenced outside from their translation 2569unit. These declarations should be in an unnamed namespace ideally. If we treat 2570these collisions liberally then CTU analysis can find more results. Note, the 2571feature be able to choose between name conflict handling strategies is still an 2572ongoing work. 2573 2574.. _CFG: 2575 2576The ``CFG`` class 2577----------------- 2578 2579The ``CFG`` class is designed to represent a source-level control-flow graph 2580for a single statement (``Stmt*``). Typically instances of ``CFG`` are 2581constructed for function bodies (usually an instance of ``CompoundStmt``), but 2582can also be instantiated to represent the control-flow of any class that 2583subclasses ``Stmt``, which includes simple expressions. Control-flow graphs 2584are especially useful for performing `flow- or path-sensitive 2585<https://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities>`_ program 2586analyses on a given function. 2587 2588Basic Blocks 2589^^^^^^^^^^^^ 2590 2591Concretely, an instance of ``CFG`` is a collection of basic blocks. Each basic 2592block is an instance of ``CFGBlock``, which simply contains an ordered sequence 2593of ``Stmt*`` (each referring to statements in the AST). The ordering of 2594statements within a block indicates unconditional flow of control from one 2595statement to the next. :ref:`Conditional control-flow 2596<ConditionalControlFlow>` is represented using edges between basic blocks. The 2597statements within a given ``CFGBlock`` can be traversed using the 2598``CFGBlock::*iterator`` interface. 2599 2600A ``CFG`` object owns the instances of ``CFGBlock`` within the control-flow 2601graph it represents. Each ``CFGBlock`` within a CFG is also uniquely numbered 2602(accessible via ``CFGBlock::getBlockID()``). Currently the number is based on 2603the ordering the blocks were created, but no assumptions should be made on how 2604``CFGBlocks`` are numbered other than their numbers are unique and that they 2605are numbered from 0..N-1 (where N is the number of basic blocks in the CFG). 2606 2607Entry and Exit Blocks 2608^^^^^^^^^^^^^^^^^^^^^ 2609 2610Each instance of ``CFG`` contains two special blocks: an *entry* block 2611(accessible via ``CFG::getEntry()``), which has no incoming edges, and an 2612*exit* block (accessible via ``CFG::getExit()``), which has no outgoing edges. 2613Neither block contains any statements, and they serve the role of providing a 2614clear entrance and exit for a body of code such as a function body. The 2615presence of these empty blocks greatly simplifies the implementation of many 2616analyses built on top of CFGs. 2617 2618.. _ConditionalControlFlow: 2619 2620Conditional Control-Flow 2621^^^^^^^^^^^^^^^^^^^^^^^^ 2622 2623Conditional control-flow (such as those induced by if-statements and loops) is 2624represented as edges between ``CFGBlocks``. Because different C language 2625constructs can induce control-flow, each ``CFGBlock`` also records an extra 2626``Stmt*`` that represents the *terminator* of the block. A terminator is 2627simply the statement that caused the control-flow, and is used to identify the 2628nature of the conditional control-flow between blocks. For example, in the 2629case of an if-statement, the terminator refers to the ``IfStmt`` object in the 2630AST that represented the given branch. 2631 2632To illustrate, consider the following code example: 2633 2634.. code-block:: c++ 2635 2636 int foo(int x) { 2637 x = x + 1; 2638 if (x > 2) 2639 x++; 2640 else { 2641 x += 2; 2642 x *= 2; 2643 } 2644 2645 return x; 2646 } 2647 2648After invoking the parser+semantic analyzer on this code fragment, the AST of 2649the body of ``foo`` is referenced by a single ``Stmt*``. We can then construct 2650an instance of ``CFG`` representing the control-flow graph of this function 2651body by single call to a static class method: 2652 2653.. code-block:: c++ 2654 2655 Stmt *FooBody = ... 2656 std::unique_ptr<CFG> FooCFG = CFG::buildCFG(FooBody); 2657 2658Along with providing an interface to iterate over its ``CFGBlocks``, the 2659``CFG`` class also provides methods that are useful for debugging and 2660visualizing CFGs. For example, the method ``CFG::dump()`` dumps a 2661pretty-printed version of the CFG to standard error. This is especially useful 2662when one is using a debugger such as gdb. For example, here is the output of 2663``FooCFG->dump()``: 2664 2665.. code-block:: text 2666 2667 [ B5 (ENTRY) ] 2668 Predecessors (0): 2669 Successors (1): B4 2670 2671 [ B4 ] 2672 1: x = x + 1 2673 2: (x > 2) 2674 T: if [B4.2] 2675 Predecessors (1): B5 2676 Successors (2): B3 B2 2677 2678 [ B3 ] 2679 1: x++ 2680 Predecessors (1): B4 2681 Successors (1): B1 2682 2683 [ B2 ] 2684 1: x += 2 2685 2: x *= 2 2686 Predecessors (1): B4 2687 Successors (1): B1 2688 2689 [ B1 ] 2690 1: return x; 2691 Predecessors (2): B2 B3 2692 Successors (1): B0 2693 2694 [ B0 (EXIT) ] 2695 Predecessors (1): B1 2696 Successors (0): 2697 2698For each block, the pretty-printed output displays for each block the number of 2699*predecessor* blocks (blocks that have outgoing control-flow to the given 2700block) and *successor* blocks (blocks that have control-flow that have incoming 2701control-flow from the given block). We can also clearly see the special entry 2702and exit blocks at the beginning and end of the pretty-printed output. For the 2703entry block (block B5), the number of predecessor blocks is 0, while for the 2704exit block (block B0) the number of successor blocks is 0. 2705 2706The most interesting block here is B4, whose outgoing control-flow represents 2707the branching caused by the sole if-statement in ``foo``. Of particular 2708interest is the second statement in the block, ``(x > 2)``, and the terminator, 2709printed as ``if [B4.2]``. The second statement represents the evaluation of 2710the condition of the if-statement, which occurs before the actual branching of 2711control-flow. Within the ``CFGBlock`` for B4, the ``Stmt*`` for the second 2712statement refers to the actual expression in the AST for ``(x > 2)``. Thus 2713pointers to subclasses of ``Expr`` can appear in the list of statements in a 2714block, and not just subclasses of ``Stmt`` that refer to proper C statements. 2715 2716The terminator of block B4 is a pointer to the ``IfStmt`` object in the AST. 2717The pretty-printer outputs ``if [B4.2]`` because the condition expression of 2718the if-statement has an actual place in the basic block, and thus the 2719terminator is essentially *referring* to the expression that is the second 2720statement of block B4 (i.e., B4.2). In this manner, conditions for 2721control-flow (which also includes conditions for loops and switch statements) 2722are hoisted into the actual basic block. 2723 2724.. Implicit Control-Flow 2725.. ^^^^^^^^^^^^^^^^^^^^^ 2726 2727.. A key design principle of the ``CFG`` class was to not require any 2728.. transformations to the AST in order to represent control-flow. Thus the 2729.. ``CFG`` does not perform any "lowering" of the statements in an AST: loops 2730.. are not transformed into guarded gotos, short-circuit operations are not 2731.. converted to a set of if-statements, and so on. 2732 2733Constant Folding in the Clang AST 2734--------------------------------- 2735 2736There are several places where constants and constant folding matter a lot to 2737the Clang front-end. First, in general, we prefer the AST to retain the source 2738code as close to how the user wrote it as possible. This means that if they 2739wrote "``5+4``", we want to keep the addition and two constants in the AST, we 2740don't want to fold to "``9``". This means that constant folding in various 2741ways turns into a tree walk that needs to handle the various cases. 2742 2743However, there are places in both C and C++ that require constants to be 2744folded. For example, the C standard defines what an "integer constant 2745expression" (i-c-e) is with very precise and specific requirements. The 2746language then requires i-c-e's in a lot of places (for example, the size of a 2747bitfield, the value for a case statement, etc). For these, we have to be able 2748to constant fold the constants, to do semantic checks (e.g., verify bitfield 2749size is non-negative and that case statements aren't duplicated). We aim for 2750Clang to be very pedantic about this, diagnosing cases when the code does not 2751use an i-c-e where one is required, but accepting the code unless running with 2752``-pedantic-errors``. 2753 2754Things get a little bit more tricky when it comes to compatibility with 2755real-world source code. Specifically, GCC has historically accepted a huge 2756superset of expressions as i-c-e's, and a lot of real world code depends on 2757this unfortunate accident of history (including, e.g., the glibc system 2758headers). GCC accepts anything its "fold" optimizer is capable of reducing to 2759an integer constant, which means that the definition of what it accepts changes 2760as its optimizer does. One example is that GCC accepts things like "``case 2761X-X:``" even when ``X`` is a variable, because it can fold this to 0. 2762 2763Another issue are how constants interact with the extensions we support, such 2764as ``__builtin_constant_p``, ``__builtin_inf``, ``__extension__`` and many 2765others. C99 obviously does not specify the semantics of any of these 2766extensions, and the definition of i-c-e does not include them. However, these 2767extensions are often used in real code, and we have to have a way to reason 2768about them. 2769 2770Finally, this is not just a problem for semantic analysis. The code generator 2771and other clients have to be able to fold constants (e.g., to initialize global 2772variables) and have to handle a superset of what C99 allows. Further, these 2773clients can benefit from extended information. For example, we know that 2774"``foo() || 1``" always evaluates to ``true``, but we can't replace the 2775expression with ``true`` because it has side effects. 2776 2777Implementation Approach 2778^^^^^^^^^^^^^^^^^^^^^^^ 2779 2780After trying several different approaches, we've finally converged on a design 2781(Note, at the time of this writing, not all of this has been implemented, 2782consider this a design goal!). Our basic approach is to define a single 2783recursive evaluation method (``Expr::Evaluate``), which is implemented 2784in ``AST/ExprConstant.cpp``. Given an expression with "scalar" type (integer, 2785fp, complex, or pointer) this method returns the following information: 2786 2787* Whether the expression is an integer constant expression, a general constant 2788 that was folded but has no side effects, a general constant that was folded 2789 but that does have side effects, or an uncomputable/unfoldable value. 2790* If the expression was computable in any way, this method returns the 2791 ``APValue`` for the result of the expression. 2792* If the expression is not evaluatable at all, this method returns information 2793 on one of the problems with the expression. This includes a 2794 ``SourceLocation`` for where the problem is, and a diagnostic ID that explains 2795 the problem. The diagnostic should have ``ERROR`` type. 2796* If the expression is not an integer constant expression, this method returns 2797 information on one of the problems with the expression. This includes a 2798 ``SourceLocation`` for where the problem is, and a diagnostic ID that 2799 explains the problem. The diagnostic should have ``EXTENSION`` type. 2800 2801This information gives various clients the flexibility that they want, and we 2802will eventually have some helper methods for various extensions. For example, 2803``Sema`` should have a ``Sema::VerifyIntegerConstantExpression`` method, which 2804calls ``Evaluate`` on the expression. If the expression is not foldable, the 2805error is emitted, and it would return ``true``. If the expression is not an 2806i-c-e, the ``EXTENSION`` diagnostic is emitted. Finally it would return 2807``false`` to indicate that the AST is OK. 2808 2809Other clients can use the information in other ways, for example, codegen can 2810just use expressions that are foldable in any way. 2811 2812Extensions 2813^^^^^^^^^^ 2814 2815This section describes how some of the various extensions Clang supports 2816interacts with constant evaluation: 2817 2818* ``__extension__``: The expression form of this extension causes any 2819 evaluatable subexpression to be accepted as an integer constant expression. 2820* ``__builtin_constant_p``: This returns true (as an integer constant 2821 expression) if the operand evaluates to either a numeric value (that is, not 2822 a pointer cast to integral type) of integral, enumeration, floating or 2823 complex type, or if it evaluates to the address of the first character of a 2824 string literal (possibly cast to some other type). As a special case, if 2825 ``__builtin_constant_p`` is the (potentially parenthesized) condition of a 2826 conditional operator expression ("``?:``"), only the true side of the 2827 conditional operator is considered, and it is evaluated with full constant 2828 folding. 2829* ``__builtin_choose_expr``: The condition is required to be an integer 2830 constant expression, but we accept any constant as an "extension of an 2831 extension". This only evaluates one operand depending on which way the 2832 condition evaluates. 2833* ``__builtin_classify_type``: This always returns an integer constant 2834 expression. 2835* ``__builtin_inf, nan, ...``: These are treated just like a floating-point 2836 literal. 2837* ``__builtin_abs, copysign, ...``: These are constant folded as general 2838 constant expressions. 2839* ``__builtin_strlen`` and ``strlen``: These are constant folded as integer 2840 constant expressions if the argument is a string literal. 2841 2842.. _Sema: 2843 2844The Sema Library 2845================ 2846 2847This library is called by the :ref:`Parser library <Parser>` during parsing to 2848do semantic analysis of the input. For valid programs, Sema builds an AST for 2849parsed constructs. 2850 2851.. _CodeGen: 2852 2853The CodeGen Library 2854=================== 2855 2856CodeGen takes an :ref:`AST <AST>` as input and produces `LLVM IR code 2857<//llvm.org/docs/LangRef.html>`_ from it. 2858 2859How to change Clang 2860=================== 2861 2862How to add an attribute 2863----------------------- 2864Attributes are a form of metadata that can be attached to a program construct, 2865allowing the programmer to pass semantic information along to the compiler for 2866various uses. For example, attributes may be used to alter the code generation 2867for a program construct, or to provide extra semantic information for static 2868analysis. This document explains how to add a custom attribute to Clang. 2869Documentation on existing attributes can be found `here 2870<//clang.llvm.org/docs/AttributeReference.html>`_. 2871 2872Attribute Basics 2873^^^^^^^^^^^^^^^^ 2874Attributes in Clang are handled in three stages: parsing into a parsed attribute 2875representation, conversion from a parsed attribute into a semantic attribute, 2876and then the semantic handling of the attribute. 2877 2878Parsing of the attribute is determined by the various syntactic forms attributes 2879can take, such as GNU, C++11, and Microsoft style attributes, as well as other 2880information provided by the table definition of the attribute. Ultimately, the 2881parsed representation of an attribute object is a ``ParsedAttr`` object. 2882These parsed attributes chain together as a list of parsed attributes attached 2883to a declarator or declaration specifier. The parsing of attributes is handled 2884automatically by Clang, except for attributes spelled as so-called “custom” 2885keywords. When implementing a custom keyword attribute, the parsing of the 2886keyword and creation of the ``ParsedAttr`` object must be done manually. 2887 2888Eventually, ``Sema::ProcessDeclAttributeList()`` is called with a ``Decl`` and 2889a ``ParsedAttr``, at which point the parsed attribute can be transformed 2890into a semantic attribute. The process by which a parsed attribute is converted 2891into a semantic attribute depends on the attribute definition and semantic 2892requirements of the attribute. The end result, however, is that the semantic 2893attribute object is attached to the ``Decl`` object, and can be obtained by a 2894call to ``Decl::getAttr<T>()``. Similarly, for statement attributes, 2895``Sema::ProcessStmtAttributes()`` is called with a ``Stmt`` a list of 2896``ParsedAttr`` objects to be converted into a semantic attribute. 2897 2898The structure of the semantic attribute is also governed by the attribute 2899definition given in Attr.td. This definition is used to automatically generate 2900functionality used for the implementation of the attribute, such as a class 2901derived from ``clang::Attr``, information for the parser to use, automated 2902semantic checking for some attributes, etc. 2903 2904 2905``include/clang/Basic/Attr.td`` 2906^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2907The first step to adding a new attribute to Clang is to add its definition to 2908`include/clang/Basic/Attr.td 2909<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/Attr.td>`_. 2910This tablegen definition must derive from the ``Attr`` (tablegen, not 2911semantic) type, or one of its derivatives. Most attributes will derive from the 2912``InheritableAttr`` type, which specifies that the attribute can be inherited by 2913later redeclarations of the ``Decl`` it is associated with. 2914``InheritableParamAttr`` is similar to ``InheritableAttr``, except that the 2915attribute is written on a parameter instead of a declaration. If the attribute 2916applies to statements, it should inherit from ``StmtAttr``. If the attribute is 2917intended to apply to a type instead of a declaration, such an attribute should 2918derive from ``TypeAttr``, and will generally not be given an AST representation. 2919(Note that this document does not cover the creation of type attributes.) An 2920attribute that inherits from ``IgnoredAttr`` is parsed, but will generate an 2921ignored attribute diagnostic when used, which may be useful when an attribute is 2922supported by another vendor but not supported by clang. 2923 2924The definition will specify several key pieces of information, such as the 2925semantic name of the attribute, the spellings the attribute supports, the 2926arguments the attribute expects, and more. Most members of the ``Attr`` tablegen 2927type do not require definitions in the derived definition as the default 2928suffice. However, every attribute must specify at least a spelling list, a 2929subject list, and a documentation list. 2930 2931Spellings 2932~~~~~~~~~ 2933All attributes are required to specify a spelling list that denotes the ways in 2934which the attribute can be spelled. For instance, a single semantic attribute 2935may have a keyword spelling, as well as a C++11 spelling and a GNU spelling. An 2936empty spelling list is also permissible and may be useful for attributes which 2937are created implicitly. The following spellings are accepted: 2938 2939 ================== ========================================================= 2940 Spelling Description 2941 ================== ========================================================= 2942 ``GNU`` Spelled with a GNU-style ``__attribute__((attr))`` 2943 syntax and placement. 2944 ``CXX11`` Spelled with a C++-style ``[[attr]]`` syntax with an 2945 optional vendor-specific namespace. 2946 ``C23`` Spelled with a C-style ``[[attr]]`` syntax with an 2947 optional vendor-specific namespace. 2948 ``Declspec`` Spelled with a Microsoft-style ``__declspec(attr)`` 2949 syntax. 2950 ``CustomKeyword`` The attribute is spelled as a keyword, and requires 2951 custom parsing. 2952 ``RegularKeyword`` The attribute is spelled as a keyword. It can be 2953 used in exactly the places that the standard 2954 ``[[attr]]`` syntax can be used, and appertains to 2955 exactly the same thing that a standard attribute 2956 would appertain to. Lexing and parsing of the keyword 2957 are handled automatically. 2958 ``GCC`` Specifies two or three spellings: the first is a 2959 GNU-style spelling, the second is a C++-style spelling 2960 with the ``gnu`` namespace, and the third is an optional 2961 C-style spelling with the ``gnu`` namespace. Attributes 2962 should only specify this spelling for attributes 2963 supported by GCC. 2964 ``Clang`` Specifies two or three spellings: the first is a 2965 GNU-style spelling, the second is a C++-style spelling 2966 with the ``clang`` namespace, and the third is an 2967 optional C-style spelling with the ``clang`` namespace. 2968 By default, a C-style spelling is provided. 2969 ``Pragma`` The attribute is spelled as a ``#pragma``, and requires 2970 custom processing within the preprocessor. If the 2971 attribute is meant to be used by Clang, it should 2972 set the namespace to ``"clang"``. Note that this 2973 spelling is not used for declaration attributes. 2974 ================== ========================================================= 2975 2976The C++ standard specifies that “any [non-standard attribute] that is not 2977recognized by the implementation is ignored” (``[dcl.attr.grammar]``). 2978The rule for C is similar. This makes ``CXX11`` and ``C23`` spellings 2979unsuitable for attributes that affect the type system, that change the 2980binary interface of the code, or that have other similar semantic meaning. 2981 2982``RegularKeyword`` provides an alternative way of spelling such attributes. 2983It reuses the production rules for standard attributes, but it applies them 2984to plain keywords rather than to ``[[…]]`` sequences. Compilers that don't 2985recognize the keyword are likely to report an error of some kind. 2986 2987For example, the ``ArmStreaming`` function type attribute affects 2988both the type system and the binary interface of the function. 2989It cannot therefore be spelled ``[[arm::streaming]]``, since compilers 2990that don't understand ``arm::streaming`` would ignore it and miscompile 2991the code. ``ArmStreaming`` is instead spelled ``__arm_streaming``, but it 2992can appear wherever a hypothetical ``[[arm::streaming]]`` could appear. 2993 2994Subjects 2995~~~~~~~~ 2996Attributes appertain to one or more subjects. If the attribute attempts to 2997attach to a subject that is not in the subject list, a diagnostic is issued 2998automatically. Whether the diagnostic is a warning or an error depends on how 2999the attribute's ``SubjectList`` is defined, but the default behavior is to warn. 3000The diagnostics displayed to the user are automatically determined based on the 3001subjects in the list, but a custom diagnostic parameter can also be specified in 3002the ``SubjectList``. The diagnostics generated for subject list violations are 3003calculated automatically or specified by the subject list itself. If a 3004previously unused Decl node is added to the ``SubjectList``, the logic used to 3005automatically determine the diagnostic parameter in `utils/TableGen/ClangAttrEmitter.cpp 3006<https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/ClangAttrEmitter.cpp>`_ 3007may need to be updated. 3008 3009By default, all subjects in the SubjectList must either be a Decl node defined 3010in ``DeclNodes.td``, or a statement node defined in ``StmtNodes.td``. However, 3011more complex subjects can be created by creating a ``SubsetSubject`` object. 3012Each such object has a base subject which it appertains to (which must be a 3013Decl or Stmt node, and not a SubsetSubject node), and some custom code which is 3014called when determining whether an attribute appertains to the subject. For 3015instance, a ``NonBitField`` SubsetSubject appertains to a ``FieldDecl``, and 3016tests whether the given FieldDecl is a bit field. When a SubsetSubject is 3017specified in a SubjectList, a custom diagnostic parameter must also be provided. 3018 3019Diagnostic checking for attribute subject lists for declaration and statement 3020attributes is automated except when ``HasCustomParsing`` is set to ``1``. 3021 3022Documentation 3023~~~~~~~~~~~~~ 3024All attributes must have some form of documentation associated with them. 3025Documentation is table generated on the public web server by a server-side 3026process that runs daily. Generally, the documentation for an attribute is a 3027stand-alone definition in `include/clang/Basic/AttrDocs.td 3028<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/AttrDocs.td>`_ 3029that is named after the attribute being documented. 3030 3031If the attribute is not for public consumption, or is an implicitly-created 3032attribute that has no visible spelling, the documentation list can specify the 3033``InternalOnly`` object. Otherwise, the attribute should have its documentation 3034added to AttrDocs.td. 3035 3036Documentation derives from the ``Documentation`` tablegen type. All derived 3037types must specify a documentation category and the actual documentation itself. 3038Additionally, it can specify a custom heading for the attribute, though a 3039default heading will be chosen when possible. 3040 3041There are four predefined documentation categories: ``DocCatFunction`` for 3042attributes that appertain to function-like subjects, ``DocCatVariable`` for 3043attributes that appertain to variable-like subjects, ``DocCatType`` for type 3044attributes, and ``DocCatStmt`` for statement attributes. A custom documentation 3045category should be used for groups of attributes with similar functionality. 3046Custom categories are good for providing overview information for the attributes 3047grouped under it. For instance, the consumed annotation attributes define a 3048custom category, ``DocCatConsumed``, that explains what consumed annotations are 3049at a high level. 3050 3051Documentation content (whether it is for an attribute or a category) is written 3052using reStructuredText (RST) syntax. 3053 3054After writing the documentation for the attribute, it should be locally tested 3055to ensure that there are no issues generating the documentation on the server. 3056Local testing requires a fresh build of clang-tblgen. To generate the attribute 3057documentation, execute the following command:: 3058 3059 clang-tblgen -gen-attr-docs -I /path/to/clang/include /path/to/clang/include/clang/Basic/Attr.td -o /path/to/clang/docs/AttributeReference.rst 3060 3061When testing locally, *do not* commit changes to ``AttributeReference.rst``. 3062This file is generated by the server automatically, and any changes made to this 3063file will be overwritten. 3064 3065Arguments 3066~~~~~~~~~ 3067Attributes may optionally specify a list of arguments that can be passed to the 3068attribute. Attribute arguments specify both the parsed form and the semantic 3069form of the attribute. For example, if ``Args`` is 3070``[StringArgument<"Arg1">, IntArgument<"Arg2">]`` then 3071``__attribute__((myattribute("Hello", 3)))`` will be a valid use; it requires 3072two arguments while parsing, and the Attr subclass' constructor for the 3073semantic attribute will require a string and integer argument. 3074 3075All arguments have a name and a flag that specifies whether the argument is 3076optional. The associated C++ type of the argument is determined by the argument 3077definition type. If the existing argument types are insufficient, new types can 3078be created, but it requires modifying `utils/TableGen/ClangAttrEmitter.cpp 3079<https://github.com/llvm/llvm-project/blob/main/clang/utils/TableGen/ClangAttrEmitter.cpp>`_ 3080to properly support the type. 3081 3082Other Properties 3083~~~~~~~~~~~~~~~~ 3084The ``Attr`` definition has other members which control the behavior of the 3085attribute. Many of them are special-purpose and beyond the scope of this 3086document, however a few deserve mention. 3087 3088If the parsed form of the attribute is more complex, or differs from the 3089semantic form, the ``HasCustomParsing`` bit can be set to ``1`` for the class, 3090and the parsing code in `Parser::ParseGNUAttributeArgs() 3091<https://github.com/llvm/llvm-project/blob/main/clang/lib/Parse/ParseDecl.cpp>`_ 3092can be updated for the special case. Note that this only applies to arguments 3093with a GNU spelling -- attributes with a __declspec spelling currently ignore 3094this flag and are handled by ``Parser::ParseMicrosoftDeclSpec``. 3095 3096Note that setting this member to 1 will opt out of common attribute semantic 3097handling, requiring extra implementation efforts to ensure the attribute 3098appertains to the appropriate subject, etc. 3099 3100If the attribute should not be propagated from a template declaration to an 3101instantiation of the template, set the ``Clone`` member to 0. By default, all 3102attributes will be cloned to template instantiations. 3103 3104Attributes that do not require an AST node should set the ``ASTNode`` field to 3105``0`` to avoid polluting the AST. Note that anything inheriting from 3106``TypeAttr`` or ``IgnoredAttr`` automatically do not generate an AST node. All 3107other attributes generate an AST node by default. The AST node is the semantic 3108representation of the attribute. 3109 3110The ``LangOpts`` field specifies a list of language options required by the 3111attribute. For instance, all of the CUDA-specific attributes specify ``[CUDA]`` 3112for the ``LangOpts`` field, and when the CUDA language option is not enabled, an 3113"attribute ignored" warning diagnostic is emitted. Since language options are 3114not table generated nodes, new language options must be created manually and 3115should specify the spelling used by ``LangOptions`` class. 3116 3117Custom accessors can be generated for an attribute based on the spelling list 3118for that attribute. For instance, if an attribute has two different spellings: 3119'Foo' and 'Bar', accessors can be created: 3120``[Accessor<"isFoo", [GNU<"Foo">]>, Accessor<"isBar", [GNU<"Bar">]>]`` 3121These accessors will be generated on the semantic form of the attribute, 3122accepting no arguments and returning a ``bool``. 3123 3124Attributes that do not require custom semantic handling should set the 3125``SemaHandler`` field to ``0``. Note that anything inheriting from 3126``IgnoredAttr`` automatically do not get a semantic handler. All other 3127attributes are assumed to use a semantic handler by default. Attributes 3128without a semantic handler are not given a parsed attribute ``Kind`` enumerator. 3129 3130"Simple" attributes, that require no custom semantic processing aside from what 3131is automatically provided, should set the ``SimpleHandler`` field to ``1``. 3132 3133Target-specific attributes may share a spelling with other attributes in 3134different targets. For instance, the ARM and MSP430 targets both have an 3135attribute spelled ``GNU<"interrupt">``, but with different parsing and semantic 3136requirements. To support this feature, an attribute inheriting from 3137``TargetSpecificAttribute`` may specify a ``ParseKind`` field. This field 3138should be the same value between all arguments sharing a spelling, and 3139corresponds to the parsed attribute's ``Kind`` enumerator. This allows 3140attributes to share a parsed attribute kind, but have distinct semantic 3141attribute classes. For instance, ``ParsedAttr`` is the shared 3142parsed attribute kind, but ARMInterruptAttr and MSP430InterruptAttr are the 3143semantic attributes generated. 3144 3145By default, attribute arguments are parsed in an evaluated context. If the 3146arguments for an attribute should be parsed in an unevaluated context (akin to 3147the way the argument to a ``sizeof`` expression is parsed), set 3148``ParseArgumentsAsUnevaluated`` to ``1``. 3149 3150If additional functionality is desired for the semantic form of the attribute, 3151the ``AdditionalMembers`` field specifies code to be copied verbatim into the 3152semantic attribute class object, with ``public`` access. 3153 3154If two or more attributes cannot be used in combination on the same declaration 3155or statement, a ``MutualExclusions`` definition can be supplied to automatically 3156generate diagnostic code. This will disallow the attribute combinations 3157regardless of spellings used. Additionally, it will diagnose combinations within 3158the same attribute list, different attribute list, and redeclarations, as 3159appropriate. 3160 3161Boilerplate 3162^^^^^^^^^^^ 3163All semantic processing of declaration attributes happens in `lib/Sema/SemaDeclAttr.cpp 3164<https://github.com/llvm/llvm-project/blob/main/clang/lib/Sema/SemaDeclAttr.cpp>`_, 3165and generally starts in the ``ProcessDeclAttribute()`` function. If the 3166attribute has the ``SimpleHandler`` field set to ``1`` then the function to 3167process the attribute will be automatically generated, and nothing needs to be 3168done here. Otherwise, write a new ``handleYourAttr()`` function, and add that to 3169the switch statement. Please do not implement handling logic directly in the 3170``case`` for the attribute. 3171 3172Unless otherwise specified by the attribute definition, common semantic checking 3173of the parsed attribute is handled automatically. This includes diagnosing 3174parsed attributes that do not appertain to the given ``Decl`` or ``Stmt``, 3175ensuring the correct minimum number of arguments are passed, etc. 3176 3177If the attribute adds additional warnings, define a ``DiagGroup`` in 3178`include/clang/Basic/DiagnosticGroups.td 3179<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/DiagnosticGroups.td>`_ 3180named after the attribute's ``Spelling`` with "_"s replaced by "-"s. If there 3181is only a single diagnostic, it is permissible to use ``InGroup<DiagGroup<"your-attribute">>`` 3182directly in `DiagnosticSemaKinds.td 3183<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/DiagnosticSemaKinds.td>`_ 3184 3185All semantic diagnostics generated for your attribute, including automatically- 3186generated ones (such as subjects and argument counts), should have a 3187corresponding test case. 3188 3189Semantic handling 3190^^^^^^^^^^^^^^^^^ 3191Most attributes are implemented to have some effect on the compiler. For 3192instance, to modify the way code is generated, or to add extra semantic checks 3193for an analysis pass, etc. Having added the attribute definition and conversion 3194to the semantic representation for the attribute, what remains is to implement 3195the custom logic requiring use of the attribute. 3196 3197The ``clang::Decl`` object can be queried for the presence or absence of an 3198attribute using ``hasAttr<T>()``. To obtain a pointer to the semantic 3199representation of the attribute, ``getAttr<T>`` may be used. 3200 3201The ``clang::AttributedStmt`` object can be queried for the presence or absence 3202of an attribute by calling ``getAttrs()`` and looping over the list of 3203attributes. 3204 3205How to add an expression or statement 3206------------------------------------- 3207 3208Expressions and statements are one of the most fundamental constructs within a 3209compiler, because they interact with many different parts of the AST, semantic 3210analysis, and IR generation. Therefore, adding a new expression or statement 3211kind into Clang requires some care. The following list details the various 3212places in Clang where an expression or statement needs to be introduced, along 3213with patterns to follow to ensure that the new expression or statement works 3214well across all of the C languages. We focus on expressions, but statements 3215are similar. 3216 3217#. Introduce parsing actions into the parser. Recursive-descent parsing is 3218 mostly self-explanatory, but there are a few things that are worth keeping 3219 in mind: 3220 3221 * Keep as much source location information as possible! You'll want it later 3222 to produce great diagnostics and support Clang's various features that map 3223 between source code and the AST. 3224 * Write tests for all of the "bad" parsing cases, to make sure your recovery 3225 is good. If you have matched delimiters (e.g., parentheses, square 3226 brackets, etc.), use ``Parser::BalancedDelimiterTracker`` to give nice 3227 diagnostics when things go wrong. 3228 3229#. Introduce semantic analysis actions into ``Sema``. Semantic analysis should 3230 always involve two functions: an ``ActOnXXX`` function that will be called 3231 directly from the parser, and a ``BuildXXX`` function that performs the 3232 actual semantic analysis and will (eventually!) build the AST node. It's 3233 fairly common for the ``ActOnXXX`` function to do very little (often just 3234 some minor translation from the parser's representation to ``Sema``'s 3235 representation of the same thing), but the separation is still important: 3236 C++ template instantiation, for example, should always call the ``BuildXXX`` 3237 variant. Several notes on semantic analysis before we get into construction 3238 of the AST: 3239 3240 * Your expression probably involves some types and some subexpressions. 3241 Make sure to fully check that those types, and the types of those 3242 subexpressions, meet your expectations. Add implicit conversions where 3243 necessary to make sure that all of the types line up exactly the way you 3244 want them. Write extensive tests to check that you're getting good 3245 diagnostics for mistakes and that you can use various forms of 3246 subexpressions with your expression. 3247 * When type-checking a type or subexpression, make sure to first check 3248 whether the type is "dependent" (``Type::isDependentType()``) or whether a 3249 subexpression is type-dependent (``Expr::isTypeDependent()``). If any of 3250 these return ``true``, then you're inside a template and you can't do much 3251 type-checking now. That's normal, and your AST node (when you get there) 3252 will have to deal with this case. At this point, you can write tests that 3253 use your expression within templates, but don't try to instantiate the 3254 templates. 3255 * For each subexpression, be sure to call ``Sema::CheckPlaceholderExpr()`` 3256 to deal with "weird" expressions that don't behave well as subexpressions. 3257 Then, determine whether you need to perform lvalue-to-rvalue conversions 3258 (``Sema::DefaultLvalueConversions``) or the usual unary conversions 3259 (``Sema::UsualUnaryConversions``), for places where the subexpression is 3260 producing a value you intend to use. 3261 * Your ``BuildXXX`` function will probably just return ``ExprError()`` at 3262 this point, since you don't have an AST. That's perfectly fine, and 3263 shouldn't impact your testing. 3264 3265#. Introduce an AST node for your new expression. This starts with declaring 3266 the node in ``include/Basic/StmtNodes.td`` and creating a new class for your 3267 expression in the appropriate ``include/AST/Expr*.h`` header. It's best to 3268 look at the class for a similar expression to get ideas, and there are some 3269 specific things to watch for: 3270 3271 * If you need to allocate memory, use the ``ASTContext`` allocator to 3272 allocate memory. Never use raw ``malloc`` or ``new``, and never hold any 3273 resources in an AST node, because the destructor of an AST node is never 3274 called. 3275 * Make sure that ``getSourceRange()`` covers the exact source range of your 3276 expression. This is needed for diagnostics and for IDE support. 3277 * Make sure that ``children()`` visits all of the subexpressions. This is 3278 important for a number of features (e.g., IDE support, C++ variadic 3279 templates). If you have sub-types, you'll also need to visit those 3280 sub-types in ``RecursiveASTVisitor``. 3281 * Add printing support (``StmtPrinter.cpp``) for your expression. 3282 * Add profiling support (``StmtProfile.cpp``) for your AST node, noting the 3283 distinguishing (non-source location) characteristics of an instance of 3284 your expression. Omitting this step will lead to hard-to-diagnose 3285 failures regarding matching of template declarations. 3286 * Add serialization support (``ASTReaderStmt.cpp``, ``ASTWriterStmt.cpp``) 3287 for your AST node. 3288 3289#. Teach semantic analysis to build your AST node. At this point, you can wire 3290 up your ``Sema::BuildXXX`` function to actually create your AST. A few 3291 things to check at this point: 3292 3293 * If your expression can construct a new C++ class or return a new 3294 Objective-C object, be sure to update and then call 3295 ``Sema::MaybeBindToTemporary`` for your just-created AST node to be sure 3296 that the object gets properly destructed. An easy way to test this is to 3297 return a C++ class with a private destructor: semantic analysis should 3298 flag an error here with the attempt to call the destructor. 3299 * Inspect the generated AST by printing it using ``clang -cc1 -ast-print``, 3300 to make sure you're capturing all of the important information about how 3301 the AST was written. 3302 * Inspect the generated AST under ``clang -cc1 -ast-dump`` to verify that 3303 all of the types in the generated AST line up the way you want them. 3304 Remember that clients of the AST should never have to "think" to 3305 understand what's going on. For example, all implicit conversions should 3306 show up explicitly in the AST. 3307 * Write tests that use your expression as a subexpression of other, 3308 well-known expressions. Can you call a function using your expression as 3309 an argument? Can you use the ternary operator? 3310 3311#. Teach code generation to create IR to your AST node. This step is the first 3312 (and only) that requires knowledge of LLVM IR. There are several things to 3313 keep in mind: 3314 3315 * Code generation is separated into scalar/aggregate/complex and 3316 lvalue/rvalue paths, depending on what kind of result your expression 3317 produces. On occasion, this requires some careful factoring of code to 3318 avoid duplication. 3319 * ``CodeGenFunction`` contains functions ``ConvertType`` and 3320 ``ConvertTypeForMem`` that convert Clang's types (``clang::Type*`` or 3321 ``clang::QualType``) to LLVM types. Use the former for values, and the 3322 latter for memory locations: test with the C++ "``bool``" type to check 3323 this. If you find that you are having to use LLVM bitcasts to make the 3324 subexpressions of your expression have the type that your expression 3325 expects, STOP! Go fix semantic analysis and the AST so that you don't 3326 need these bitcasts. 3327 * The ``CodeGenFunction`` class has a number of helper functions to make 3328 certain operations easy, such as generating code to produce an lvalue or 3329 an rvalue, or to initialize a memory location with a given value. Prefer 3330 to use these functions rather than directly writing loads and stores, 3331 because these functions take care of some of the tricky details for you 3332 (e.g., for exceptions). 3333 * If your expression requires some special behavior in the event of an 3334 exception, look at the ``push*Cleanup`` functions in ``CodeGenFunction`` 3335 to introduce a cleanup. You shouldn't have to deal with 3336 exception-handling directly. 3337 * Testing is extremely important in IR generation. Use ``clang -cc1 3338 -emit-llvm`` and `FileCheck 3339 <https://llvm.org/docs/CommandGuide/FileCheck.html>`_ to verify that you're 3340 generating the right IR. 3341 3342#. Teach template instantiation how to cope with your AST node, which requires 3343 some fairly simple code: 3344 3345 * Make sure that your expression's constructor properly computes the flags 3346 for type dependence (i.e., the type your expression produces can change 3347 from one instantiation to the next), value dependence (i.e., the constant 3348 value your expression produces can change from one instantiation to the 3349 next), instantiation dependence (i.e., a template parameter occurs 3350 anywhere in your expression), and whether your expression contains a 3351 parameter pack (for variadic templates). Often, computing these flags 3352 just means combining the results from the various types and 3353 subexpressions. 3354 * Add ``TransformXXX`` and ``RebuildXXX`` functions to the ``TreeTransform`` 3355 class template in ``Sema``. ``TransformXXX`` should (recursively) 3356 transform all of the subexpressions and types within your expression, 3357 using ``getDerived().TransformYYY``. If all of the subexpressions and 3358 types transform without error, it will then call the ``RebuildXXX`` 3359 function, which will in turn call ``getSema().BuildXXX`` to perform 3360 semantic analysis and build your expression. 3361 * To test template instantiation, take those tests you wrote to make sure 3362 that you were type checking with type-dependent expressions and dependent 3363 types (from step #2) and instantiate those templates with various types, 3364 some of which type-check and some that don't, and test the error messages 3365 in each case. 3366 3367#. There are some "extras" that make other features work better. It's worth 3368 handling these extras to give your expression complete integration into 3369 Clang: 3370 3371 * Add code completion support for your expression in 3372 ``SemaCodeComplete.cpp``. 3373 * If your expression has types in it, or has any "interesting" features 3374 other than subexpressions, extend libclang's ``CursorVisitor`` to provide 3375 proper visitation for your expression, enabling various IDE features such 3376 as syntax highlighting, cross-referencing, and so on. The 3377 ``c-index-test`` helper program can be used to test these features. 3378 3379Testing 3380------- 3381All functional changes to Clang should come with test coverage demonstrating 3382the change in behavior. 3383 3384.. _verifying-diagnostics: 3385 3386Verifying Diagnostics 3387^^^^^^^^^^^^^^^^^^^^^ 3388Clang ``-cc1`` supports the ``-verify`` command line option as a way to 3389validate diagnostic behavior. This option will use special comments within the 3390test file to verify that expected diagnostics appear in the correct source 3391locations. If all of the expected diagnostics match the actual output of Clang, 3392then the invocation will return normally. If there are discrepancies between 3393the expected and actual output, Clang will emit detailed information about 3394which expected diagnostics were not seen or which unexpected diagnostics were 3395seen, etc. A complete example is: 3396 3397.. code-block: c++ 3398 3399 // RUN: %clang_cc1 -verify %s 3400 int A = B; // expected-error {{use of undeclared identifier 'B'}} 3401 3402If the test is run and the expected error is emitted on the expected line, the 3403diagnostic verifier will pass. However, if the expected error does not appear 3404or appears in a different location than expected, or if additional diagnostics 3405appear, the diagnostic verifier will fail and emit information as to why. 3406 3407The ``-verify`` command optionally accepts a comma-delimited list of one or 3408more verification prefixes that can be used to craft those special comments. 3409Each prefix must start with a letter and contain only alphanumeric characters, 3410hyphens, and underscores. ``-verify`` by itself is equivalent to 3411``-verify=expected``, meaning that special comments will start with 3412``expected``. Using different prefixes makes it easier to have separate 3413``RUN:`` lines in the same test file which result in differing diagnostic 3414behavior. For example: 3415 3416.. code-block:: c++ 3417 3418 // RUN: %clang_cc1 -verify=foo,bar %s 3419 3420 int A = B; // foo-error {{use of undeclared identifier 'B'}} 3421 int C = D; // bar-error {{use of undeclared identifier 'D'}} 3422 int E = F; // expected-error {{use of undeclared identifier 'F'}} 3423 3424The verifier will recognize ``foo-error`` and ``bar-error`` as special comments 3425but will not recognize ``expected-error`` as one because the ``-verify`` line 3426does not contain that as a prefix. Thus, this test would fail verification 3427because an unexpected diagnostic would appear on the declaration of ``E``. 3428 3429Multiple occurrences accumulate prefixes. For example, 3430``-verify -verify=foo,bar -verify=baz`` is equivalent to 3431``-verify=expected,foo,bar,baz``. 3432 3433Specifying Diagnostics 3434^^^^^^^^^^^^^^^^^^^^^^ 3435Indicating that a line expects an error or a warning is easy. Put a comment 3436on the line that has the diagnostic, use 3437``expected-{error,warning,remark,note}`` to tag if it's an expected error, 3438warning, remark, or note (respectively), and place the expected text between 3439``{{`` and ``}}`` markers. The full text doesn't have to be included, only 3440enough to ensure that the correct diagnostic was emitted. (Note: full text 3441should be included in test cases unless there is a compelling reason to use 3442truncated text instead.) 3443 3444For a full description of the matching behavior, including more complex 3445matching scenarios, see :ref:`matching <DiagnosticMatching>` below. 3446 3447Here's an example of the most commonly used way to specify expected 3448diagnostics: 3449 3450.. code-block:: c++ 3451 3452 int A = B; // expected-error {{use of undeclared identifier 'B'}} 3453 3454You can place as many diagnostics on one line as you wish. To make the code 3455more readable, you can use slash-newline to separate out the diagnostics. 3456 3457Alternatively, it is possible to specify the line on which the diagnostic 3458should appear by appending ``@<line>`` to ``expected-<type>``, for example: 3459 3460.. code-block:: c++ 3461 3462 #warning some text 3463 // expected-warning@10 {{some text}} 3464 3465The line number may be absolute (as above), or relative to the current line by 3466prefixing the number with either ``+`` or ``-``. 3467 3468If the diagnostic is generated in a separate file, for example in a shared 3469header file, it may be beneficial to be able to declare the file in which the 3470diagnostic will appear, rather than placing the ``expected-*`` directive in the 3471actual file itself. This can be done using the following syntax: 3472 3473.. code-block:: c++ 3474 3475 // expected-error@path/include.h:15 {{error message}} 3476 3477The path can be absolute or relative and the same search paths will be used as 3478for ``#include`` directives. The line number in an external file may be 3479substituted with ``*`` meaning that any line number will match (useful where 3480the included file is, for example, a system header where the actual line number 3481may change and is not critical). 3482 3483As an alternative to specifying a fixed line number, the location of a 3484diagnostic can instead be indicated by a marker of the form ``#<marker>``. 3485Markers are specified by including them in a comment, and then referenced by 3486appending the marker to the diagnostic with ``@#<marker>``, as with: 3487 3488.. code-block:: c++ 3489 3490 #warning some text // #1 3491 // ... other code ... 3492 // expected-warning@#1 {{some text}} 3493 3494The name of a marker used in a directive must be unique within the compilation. 3495 3496The simple syntax above allows each specification to match exactly one 3497diagnostic. You can use the extended syntax to customize this. The extended 3498syntax is ``expected-<type> <n> {{diag text}}``, where ``<type>`` is one of 3499``error``, ``warning``, ``remark``, or ``note``, and ``<n>`` is a positive 3500integer. This allows the diagnostic to appear as many times as specified. For 3501example: 3502 3503.. code-block:: c++ 3504 3505 void f(); // expected-note 2 {{previous declaration is here}} 3506 3507Where the diagnostic is expected to occur a minimum number of times, this can 3508be specified by appending a ``+`` to the number. For example: 3509 3510.. code-block:: c++ 3511 3512 void f(); // expected-note 0+ {{previous declaration is here}} 3513 void g(); // expected-note 1+ {{previous declaration is here}} 3514 3515In the first example, the diagnostic becomes optional, i.e. it will be 3516swallowed if it occurs, but will not generate an error if it does not occur. In 3517the second example, the diagnostic must occur at least once. As a short-hand, 3518"one or more" can be specified simply by ``+``. For example: 3519 3520.. code-block:: c++ 3521 3522 void g(); // expected-note + {{previous declaration is here}} 3523 3524A range can also be specified by ``<n>-<m>``. For example: 3525 3526.. code-block:: c++ 3527 3528 void f(); // expected-note 0-1 {{previous declaration is here}} 3529 3530In this example, the diagnostic may appear only once, if at all. 3531 3532.. _DiagnosticMatching: 3533 3534Matching Modes 3535~~~~~~~~~~~~~~ 3536 3537The default matching mode is simple string, which looks for the expected text 3538that appears between the first `{{` and `}}` pair of the comment. The string is 3539interpreted just as-is, with one exception: the sequence `\n` is converted to a 3540single newline character. This mode matches the emitted diagnostic when the 3541text appears as a substring at any position of the emitted message. 3542 3543To enable matching against desired strings that contain `}}` or `{{`, the 3544string-mode parser accepts opening delimiters of more than two curly braces, 3545like `{{{`. It then looks for a closing delimiter of equal "width" (i.e `}}}`). 3546For example: 3547 3548.. code-block:: c++ 3549 3550 // expected-note {{{evaluates to '{{2, 3, 4}} == {0, 3, 4}'}}} 3551 3552The intent is to allow the delimeter to be wider than the longest `{` or `}` 3553brace sequence in the content, so that if your expected text contains `{{{` 3554(three braces) it may be delimited with `{{{{` (four braces), and so on. 3555 3556Regex matching mode may be selected by appending ``-re`` to the diagnostic type 3557and including regexes wrapped in double curly braces (`{{` and `}}`) in the 3558directive, such as: 3559 3560.. code-block:: text 3561 3562 expected-error-re {{format specifies type 'wchar_t **' (aka '{{.+}}')}} 3563 3564Examples matching error: "variable has incomplete type 'struct s'" 3565 3566.. code-block:: c++ 3567 3568 // expected-error {{variable has incomplete type 'struct s'}} 3569 // expected-error {{variable has incomplete type}} 3570 // expected-error {{{variable has incomplete type}}} 3571 // expected-error {{{{variable has incomplete type}}}} 3572 3573 // expected-error-re {{variable has type 'struct {{.}}'}} 3574 // expected-error-re {{variable has type 'struct {{.*}}'}} 3575 // expected-error-re {{variable has type 'struct {{(.*)}}'}} 3576 // expected-error-re {{variable has type 'struct{{[[:space:]](.*)}}'}} 3577 3578Feature Test Macros 3579=================== 3580Clang implements several ways to test whether a feature is supported or not. 3581Some of these feature tests are standardized, like ``__has_cpp_attribute`` or 3582``__cpp_lambdas``, while others are Clang extensions, like ``__has_builtin``. 3583The common theme among all the various feature tests is that they are a utility 3584to tell users that we think a particular feature is complete. However, 3585completeness is a difficult property to define because features may still have 3586lingering bugs, may only work on some targets, etc. We use the following 3587criteria when deciding whether to expose a feature test macro (or particular 3588result value for the feature test): 3589 3590 * Are there known issues where we reject valid code that should be accepted? 3591 * Are there known issues where we accept invalid code that should be rejected? 3592 * Are there known crashes, failed assertions, or miscompilations? 3593 * Are there known issues on a particular relevant target? 3594 3595If the answer to any of these is "yes", the feature test macro should either 3596not be defined or there should be very strong rationale for why the issues 3597should not prevent defining it. Note, it is acceptable to define the feature 3598test macro on a per-target basis if needed. 3599 3600When in doubt, being conservative is better than being aggressive. If we don't 3601claim support for the feature but it does useful things, users can still use it 3602and provide us with useful feedback on what is missing. But if we claim support 3603for a feature that has significant bugs, we've eliminated most of the utility 3604of having a feature testing macro at all because users are then forced to test 3605what compiler version is in use to get a more accurate answer. 3606 3607The status reported by the feature test macro should always be reflected in the 3608language support page for the corresponding feature (`C++ 3609<https://clang.llvm.org/cxx_status.html>`_, `C 3610<https://clang.llvm.org/c_status.html>`_) if applicable. This page can give 3611more nuanced information to the user as well, such as claiming partial support 3612for a feature and specifying details as to what remains to be done. 3613