xref: /llvm-project/llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl10.rst (revision a749e3295df4aee18a0ad723875a6501f30ac744)
1d80f118eSChris Lattner======================================================
2d80f118eSChris LattnerKaleidoscope: Conclusion and other useful LLVM tidbits
3d80f118eSChris Lattner======================================================
4d80f118eSChris Lattner
5d80f118eSChris Lattner.. contents::
6d80f118eSChris Lattner   :local:
7d80f118eSChris Lattner
8d80f118eSChris LattnerTutorial Conclusion
9d80f118eSChris Lattner===================
10d80f118eSChris Lattner
11d80f118eSChris LattnerWelcome to the final chapter of the "`Implementing a language with
12d80f118eSChris LattnerLLVM <index.html>`_" tutorial. In the course of this tutorial, we have
13d80f118eSChris Lattnergrown our little Kaleidoscope language from being a useless toy, to
14d80f118eSChris Lattnerbeing a semi-interesting (but probably still useless) toy. :)
15d80f118eSChris Lattner
16d80f118eSChris LattnerIt is interesting to see how far we've come, and how little code it has
17d80f118eSChris Lattnertaken. We built the entire lexer, parser, AST, code generator, an
18d80f118eSChris Lattnerinteractive run-loop (with a JIT!), and emitted debug information in
19d80f118eSChris Lattnerstandalone executables - all in under 1000 lines of (non-comment/non-blank)
20d80f118eSChris Lattnercode.
21d80f118eSChris Lattner
22d80f118eSChris LattnerOur little language supports a couple of interesting features: it
23d80f118eSChris Lattnersupports user defined binary and unary operators, it uses JIT
24d80f118eSChris Lattnercompilation for immediate evaluation, and it supports a few control flow
25d80f118eSChris Lattnerconstructs with SSA construction.
26d80f118eSChris Lattner
27d80f118eSChris LattnerPart of the idea of this tutorial was to show you how easy and fun it
28d80f118eSChris Lattnercan be to define, build, and play with languages. Building a compiler
29d80f118eSChris Lattnerneed not be a scary or mystical process! Now that you've seen some of
30d80f118eSChris Lattnerthe basics, I strongly encourage you to take the code and hack on it.
31d80f118eSChris LattnerFor example, try adding:
32d80f118eSChris Lattner
33e8fa9014SKazu Hirata-  **global variables** - While global variables have questionable value
34d80f118eSChris Lattner   in modern software engineering, they are often useful when putting
35d80f118eSChris Lattner   together quick little hacks like the Kaleidoscope compiler itself.
36d80f118eSChris Lattner   Fortunately, our current setup makes it very easy to add global
37d80f118eSChris Lattner   variables: just have value lookup check to see if an unresolved
38d80f118eSChris Lattner   variable is in the global variable symbol table before rejecting it.
39d80f118eSChris Lattner   To create a new global variable, make an instance of the LLVM
40d80f118eSChris Lattner   ``GlobalVariable`` class.
41d80f118eSChris Lattner-  **typed variables** - Kaleidoscope currently only supports variables
42d80f118eSChris Lattner   of type double. This gives the language a very nice elegance, because
43d80f118eSChris Lattner   only supporting one type means that you never have to specify types.
44d80f118eSChris Lattner   Different languages have different ways of handling this. The easiest
45d80f118eSChris Lattner   way is to require the user to specify types for every variable
46d80f118eSChris Lattner   definition, and record the type of the variable in the symbol table
47d80f118eSChris Lattner   along with its Value\*.
48d80f118eSChris Lattner-  **arrays, structs, vectors, etc** - Once you add types, you can start
49d80f118eSChris Lattner   extending the type system in all sorts of interesting ways. Simple
50d80f118eSChris Lattner   arrays are very easy and are quite useful for many different
51d80f118eSChris Lattner   applications. Adding them is mostly an exercise in learning how the
522916489cSkristina   LLVM `getelementptr <../../LangRef.html#getelementptr-instruction>`_ instruction
53d80f118eSChris Lattner   works: it is so nifty/unconventional, it `has its own
542916489cSkristina   FAQ <../../GetElementPtr.html>`_!
55d80f118eSChris Lattner-  **standard runtime** - Our current language allows the user to access
56d80f118eSChris Lattner   arbitrary external functions, and we use it for things like "printd"
57d80f118eSChris Lattner   and "putchard". As you extend the language to add higher-level
58d80f118eSChris Lattner   constructs, often these constructs make the most sense if they are
59d80f118eSChris Lattner   lowered to calls into a language-supplied runtime. For example, if
60d80f118eSChris Lattner   you add hash tables to the language, it would probably make sense to
61d80f118eSChris Lattner   add the routines to a runtime, instead of inlining them all the way.
62d80f118eSChris Lattner-  **memory management** - Currently we can only access the stack in
63d80f118eSChris Lattner   Kaleidoscope. It would also be useful to be able to allocate heap
64d80f118eSChris Lattner   memory, either with calls to the standard libc malloc/free interface
65d80f118eSChris Lattner   or with a garbage collector. If you would like to use garbage
66d80f118eSChris Lattner   collection, note that LLVM fully supports `Accurate Garbage
672916489cSkristina   Collection <../../GarbageCollection.html>`_ including algorithms that
68d80f118eSChris Lattner   move objects and need to scan/update the stack.
69d80f118eSChris Lattner-  **exception handling support** - LLVM supports generation of `zero
702916489cSkristina   cost exceptions <../../ExceptionHandling.html>`_ which interoperate with
71d80f118eSChris Lattner   code compiled in other languages. You could also generate code by
72d80f118eSChris Lattner   implicitly making every function return an error value and checking
73d80f118eSChris Lattner   it. You could also make explicit use of setjmp/longjmp. There are
74d80f118eSChris Lattner   many different ways to go here.
75d80f118eSChris Lattner-  **object orientation, generics, database access, complex numbers,
76d80f118eSChris Lattner   geometric programming, ...** - Really, there is no end of crazy
77d80f118eSChris Lattner   features that you can add to the language.
78d80f118eSChris Lattner-  **unusual domains** - We've been talking about applying LLVM to a
79d80f118eSChris Lattner   domain that many people are interested in: building a compiler for a
80d80f118eSChris Lattner   specific language. However, there are many other domains that can use
81d80f118eSChris Lattner   compiler technology that are not typically considered. For example,
82d80f118eSChris Lattner   LLVM has been used to implement OpenGL graphics acceleration,
83d80f118eSChris Lattner   translate C++ code to ActionScript, and many other cute and clever
84d80f118eSChris Lattner   things. Maybe you will be the first to JIT compile a regular
85d80f118eSChris Lattner   expression interpreter into native code with LLVM?
86d80f118eSChris Lattner
87d80f118eSChris LattnerHave fun - try doing something crazy and unusual. Building a language
88d80f118eSChris Lattnerlike everyone else always has, is much less fun than trying something a
89d80f118eSChris Lattnerlittle crazy or off the wall and seeing how it turns out. If you get
90*a749e329SDanny Möschstuck or want to talk about it, please post on the `LLVM forums
91*a749e329SDanny Mösch<https://discourse.llvm.org>`_: it has lots of people who are interested
92*a749e329SDanny Möschin languages and are often willing to help out.
93d80f118eSChris Lattner
94d80f118eSChris LattnerBefore we end this tutorial, I want to talk about some "tips and tricks"
95d80f118eSChris Lattnerfor generating LLVM IR. These are some of the more subtle things that
96d80f118eSChris Lattnermay not be obvious, but are very useful if you want to take advantage of
97d80f118eSChris LattnerLLVM's capabilities.
98d80f118eSChris Lattner
99d80f118eSChris LattnerProperties of the LLVM IR
100d80f118eSChris Lattner=========================
101d80f118eSChris Lattner
102d80f118eSChris LattnerWe have a couple of common questions about code in the LLVM IR form -
103d80f118eSChris Lattnerlet's just get these out of the way right now, shall we?
104d80f118eSChris Lattner
105d80f118eSChris LattnerTarget Independence
106d80f118eSChris Lattner-------------------
107d80f118eSChris Lattner
108d80f118eSChris LattnerKaleidoscope is an example of a "portable language": any program written
109d80f118eSChris Lattnerin Kaleidoscope will work the same way on any target that it runs on.
110d80f118eSChris LattnerMany other languages have this property, e.g. lisp, java, haskell,
111d80f118eSChris Lattnerjavascript, python, etc (note that while these languages are portable,
112d80f118eSChris Lattnernot all their libraries are).
113d80f118eSChris Lattner
114d80f118eSChris LattnerOne nice aspect of LLVM is that it is often capable of preserving target
115d80f118eSChris Lattnerindependence in the IR: you can take the LLVM IR for a
116d80f118eSChris LattnerKaleidoscope-compiled program and run it on any target that LLVM
117d80f118eSChris Lattnersupports, even emitting C code and compiling that on targets that LLVM
118d80f118eSChris Lattnerdoesn't support natively. You can trivially tell that the Kaleidoscope
119d80f118eSChris Lattnercompiler generates target-independent code because it never queries for
120d80f118eSChris Lattnerany target-specific information when generating code.
121d80f118eSChris Lattner
122d80f118eSChris LattnerThe fact that LLVM provides a compact, target-independent,
123d80f118eSChris Lattnerrepresentation for code gets a lot of people excited. Unfortunately,
124d80f118eSChris Lattnerthese people are usually thinking about C or a language from the C
125d80f118eSChris Lattnerfamily when they are asking questions about language portability. I say
126d80f118eSChris Lattner"unfortunately", because there is really no way to make (fully general)
127d80f118eSChris LattnerC code portable, other than shipping the source code around (and of
128d80f118eSChris Lattnercourse, C source code is not actually portable in general either - ever
129d80f118eSChris Lattnerport a really old application from 32- to 64-bits?).
130d80f118eSChris Lattner
131d80f118eSChris LattnerThe problem with C (again, in its full generality) is that it is heavily
132d80f118eSChris Lattnerladen with target specific assumptions. As one simple example, the
133d80f118eSChris Lattnerpreprocessor often destructively removes target-independence from the
134d80f118eSChris Lattnercode when it processes the input text:
135d80f118eSChris Lattner
136d80f118eSChris Lattner.. code-block:: c
137d80f118eSChris Lattner
138d80f118eSChris Lattner    #ifdef __i386__
139d80f118eSChris Lattner      int X = 1;
140d80f118eSChris Lattner    #else
141d80f118eSChris Lattner      int X = 42;
142d80f118eSChris Lattner    #endif
143d80f118eSChris Lattner
144d80f118eSChris LattnerWhile it is possible to engineer more and more complex solutions to
145d80f118eSChris Lattnerproblems like this, it cannot be solved in full generality in a way that
146d80f118eSChris Lattneris better than shipping the actual source code.
147d80f118eSChris Lattner
148d80f118eSChris LattnerThat said, there are interesting subsets of C that can be made portable.
149d80f118eSChris LattnerIf you are willing to fix primitive types to a fixed size (say int =
150d80f118eSChris Lattner32-bits, and long = 64-bits), don't care about ABI compatibility with
151d80f118eSChris Lattnerexisting binaries, and are willing to give up some other minor features,
152d80f118eSChris Lattneryou can have portable code. This can make sense for specialized domains
153d80f118eSChris Lattnersuch as an in-kernel language.
154d80f118eSChris Lattner
155d80f118eSChris LattnerSafety Guarantees
156d80f118eSChris Lattner-----------------
157d80f118eSChris Lattner
158d80f118eSChris LattnerMany of the languages above are also "safe" languages: it is impossible
159d80f118eSChris Lattnerfor a program written in Java to corrupt its address space and crash the
160d80f118eSChris Lattnerprocess (assuming the JVM has no bugs). Safety is an interesting
161d80f118eSChris Lattnerproperty that requires a combination of language design, runtime
162d80f118eSChris Lattnersupport, and often operating system support.
163d80f118eSChris Lattner
164d80f118eSChris LattnerIt is certainly possible to implement a safe language in LLVM, but LLVM
165d80f118eSChris LattnerIR does not itself guarantee safety. The LLVM IR allows unsafe pointer
166d80f118eSChris Lattnercasts, use after free bugs, buffer over-runs, and a variety of other
167d80f118eSChris Lattnerproblems. Safety needs to be implemented as a layer on top of LLVM and,
168*a749e329SDanny Möschconveniently, several groups have investigated this. Ask on the `LLVM
169*a749e329SDanny Möschforums <https://discourse.llvm.org>`_ if you are interested in more details.
170d80f118eSChris Lattner
171d80f118eSChris LattnerLanguage-Specific Optimizations
172d80f118eSChris Lattner-------------------------------
173d80f118eSChris Lattner
174d80f118eSChris LattnerOne thing about LLVM that turns off many people is that it does not
175d80f118eSChris Lattnersolve all the world's problems in one system.  One specific
176d80f118eSChris Lattnercomplaint is that people perceive LLVM as being incapable of performing
177d80f118eSChris Lattnerhigh-level language-specific optimization: LLVM "loses too much
178d80f118eSChris Lattnerinformation".  Here are a few observations about this:
179d80f118eSChris Lattner
180d80f118eSChris LattnerFirst, you're right that LLVM does lose information. For example, as of
181d80f118eSChris Lattnerthis writing, there is no way to distinguish in the LLVM IR whether an
182d80f118eSChris LattnerSSA-value came from a C "int" or a C "long" on an ILP32 machine (other
183d80f118eSChris Lattnerthan debug info). Both get compiled down to an 'i32' value and the
184d80f118eSChris Lattnerinformation about what it came from is lost. The more general issue
185d80f118eSChris Lattnerhere, is that the LLVM type system uses "structural equivalence" instead
186d80f118eSChris Lattnerof "name equivalence". Another place this surprises people is if you
187d80f118eSChris Lattnerhave two types in a high-level language that have the same structure
188d80f118eSChris Lattner(e.g. two different structs that have a single int field): these types
189d80f118eSChris Lattnerwill compile down into a single LLVM type and it will be impossible to
190d80f118eSChris Lattnertell what it came from.
191d80f118eSChris Lattner
192d80f118eSChris LattnerSecond, while LLVM does lose information, LLVM is not a fixed target: we
193d80f118eSChris Lattnercontinue to enhance and improve it in many different ways. In addition
194d80f118eSChris Lattnerto adding new features (LLVM did not always support exceptions or debug
195d80f118eSChris Lattnerinfo), we also extend the IR to capture important information for
196d80f118eSChris Lattneroptimization (e.g. whether an argument is sign or zero extended,
197d80f118eSChris Lattnerinformation about pointers aliasing, etc). Many of the enhancements are
198d80f118eSChris Lattneruser-driven: people want LLVM to include some specific feature, so they
199d80f118eSChris Lattnergo ahead and extend it.
200d80f118eSChris Lattner
201d80f118eSChris LattnerThird, it is *possible and easy* to add language-specific optimizations,
202d80f118eSChris Lattnerand you have a number of choices in how to do it. As one trivial
203d80f118eSChris Lattnerexample, it is easy to add language-specific optimization passes that
204d80f118eSChris Lattner"know" things about code compiled for a language. In the case of the C
205d80f118eSChris Lattnerfamily, there is an optimization pass that "knows" about the standard C
206d80f118eSChris Lattnerlibrary functions. If you call "exit(0)" in main(), it knows that it is
207d80f118eSChris Lattnersafe to optimize that into "return 0;" because C specifies what the
208d80f118eSChris Lattner'exit' function does.
209d80f118eSChris Lattner
210d80f118eSChris LattnerIn addition to simple library knowledge, it is possible to embed a
211d80f118eSChris Lattnervariety of other language-specific information into the LLVM IR. If you
212d80f118eSChris Lattnerhave a specific need and run into a wall, please bring the topic up on
213d80f118eSChris Lattnerthe llvm-dev list. At the very worst, you can always treat LLVM as if it
214d80f118eSChris Lattnerwere a "dumb code generator" and implement the high-level optimizations
215d80f118eSChris Lattneryou desire in your front-end, on the language-specific AST.
216d80f118eSChris Lattner
217d80f118eSChris LattnerTips and Tricks
218d80f118eSChris Lattner===============
219d80f118eSChris Lattner
220d80f118eSChris LattnerThere is a variety of useful tips and tricks that you come to know after
221d80f118eSChris Lattnerworking on/with LLVM that aren't obvious at first glance. Instead of
222d80f118eSChris Lattnerletting everyone rediscover them, this section talks about some of these
223d80f118eSChris Lattnerissues.
224d80f118eSChris Lattner
225d80f118eSChris LattnerImplementing portable offsetof/sizeof
226d80f118eSChris Lattner-------------------------------------
227d80f118eSChris Lattner
228d80f118eSChris LattnerOne interesting thing that comes up, if you are trying to keep the code
229d80f118eSChris Lattnergenerated by your compiler "target independent", is that you often need
230d80f118eSChris Lattnerto know the size of some LLVM type or the offset of some field in an
231d80f118eSChris Lattnerllvm structure. For example, you might need to pass the size of a type
232d80f118eSChris Lattnerinto a function that allocates memory.
233d80f118eSChris Lattner
234d80f118eSChris LattnerUnfortunately, this can vary widely across targets: for example the
235d80f118eSChris Lattnerwidth of a pointer is trivially target-specific. However, there is a
236d80f118eSChris Lattner`clever way to use the getelementptr
237d80f118eSChris Lattnerinstruction <http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_
238d80f118eSChris Lattnerthat allows you to compute this in a portable way.
239d80f118eSChris Lattner
240d80f118eSChris LattnerGarbage Collected Stack Frames
241d80f118eSChris Lattner------------------------------
242d80f118eSChris Lattner
243d80f118eSChris LattnerSome languages want to explicitly manage their stack frames, often so
244d80f118eSChris Lattnerthat they are garbage collected or to allow easy implementation of
245d80f118eSChris Lattnerclosures. There are often better ways to implement these features than
246d80f118eSChris Lattnerexplicit stack frames, but `LLVM does support
247d80f118eSChris Lattnerthem, <http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_
248d80f118eSChris Lattnerif you want. It requires your front-end to convert the code into
249d80f118eSChris Lattner`Continuation Passing
250d80f118eSChris LattnerStyle <http://en.wikipedia.org/wiki/Continuation-passing_style>`_ and
251d80f118eSChris Lattnerthe use of tail calls (which LLVM also supports).
252d80f118eSChris Lattner
253