xref: /llvm-project/llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl01.rst (revision 63cf7040814e3b190a3b5a65858d8d59ab47b74d)
1d80f118eSChris Lattner=====================================================
2d80f118eSChris LattnerKaleidoscope: Kaleidoscope Introduction and the Lexer
3d80f118eSChris Lattner=====================================================
4d80f118eSChris Lattner
5d80f118eSChris Lattner.. contents::
6d80f118eSChris Lattner   :local:
7d80f118eSChris Lattner
8d80f118eSChris LattnerThe Kaleidoscope Language
9d80f118eSChris Lattner=========================
10d80f118eSChris Lattner
11*0fa6c158SChris LattnerThis tutorial is illustrated with a toy language called
12d80f118eSChris Lattner"`Kaleidoscope <http://en.wikipedia.org/wiki/Kaleidoscope>`_" (derived
13d80f118eSChris Lattnerfrom "meaning beautiful, form, and view"). Kaleidoscope is a procedural
14d80f118eSChris Lattnerlanguage that allows you to define functions, use conditionals, math,
15d80f118eSChris Lattneretc. Over the course of the tutorial, we'll extend Kaleidoscope to
16d80f118eSChris Lattnersupport the if/then/else construct, a for loop, user defined operators,
17*0fa6c158SChris LattnerJIT compilation with a simple command line interface, debug info, etc.
18d80f118eSChris Lattner
19*0fa6c158SChris LattnerWe want to keep things simple, so the only datatype in Kaleidoscope
20d80f118eSChris Lattneris a 64-bit floating point type (aka 'double' in C parlance). As such,
21d80f118eSChris Lattnerall values are implicitly double precision and the language doesn't
22d80f118eSChris Lattnerrequire type declarations. This gives the language a very nice and
23d80f118eSChris Lattnersimple syntax. For example, the following simple example computes
24d80f118eSChris Lattner`Fibonacci numbers: <http://en.wikipedia.org/wiki/Fibonacci_number>`_
25d80f118eSChris Lattner
26d80f118eSChris Lattner::
27d80f118eSChris Lattner
28d80f118eSChris Lattner    # Compute the x'th fibonacci number.
29d80f118eSChris Lattner    def fib(x)
30d80f118eSChris Lattner      if x < 3 then
31d80f118eSChris Lattner        1
32d80f118eSChris Lattner      else
33d80f118eSChris Lattner        fib(x-1)+fib(x-2)
34d80f118eSChris Lattner
35d80f118eSChris Lattner    # This expression will compute the 40th number.
36d80f118eSChris Lattner    fib(40)
37d80f118eSChris Lattner
38*0fa6c158SChris LattnerWe also allow Kaleidoscope to call into standard library functions - the
39*0fa6c158SChris LattnerLLVM JIT makes this really easy. This means that you can use the
40d80f118eSChris Lattner'extern' keyword to define a function before you use it (this is also
41d80f118eSChris Lattneruseful for mutually recursive functions).  For example:
42d80f118eSChris Lattner
43d80f118eSChris Lattner::
44d80f118eSChris Lattner
45d80f118eSChris Lattner    extern sin(arg);
46d80f118eSChris Lattner    extern cos(arg);
47d80f118eSChris Lattner    extern atan2(arg1 arg2);
48d80f118eSChris Lattner
49d80f118eSChris Lattner    atan2(sin(.4), cos(42))
50d80f118eSChris Lattner
51d80f118eSChris LattnerA more interesting example is included in Chapter 6 where we write a
52d80f118eSChris Lattnerlittle Kaleidoscope application that `displays a Mandelbrot
53d80f118eSChris LattnerSet <LangImpl06.html#kicking-the-tires>`_ at various levels of magnification.
54d80f118eSChris Lattner
55*0fa6c158SChris LattnerLet's dive into the implementation of this language!
56d80f118eSChris Lattner
57d80f118eSChris LattnerThe Lexer
58d80f118eSChris Lattner=========
59d80f118eSChris Lattner
60d80f118eSChris LattnerWhen it comes to implementing a language, the first thing needed is the
61d80f118eSChris Lattnerability to process a text file and recognize what it says. The
62d80f118eSChris Lattnertraditional way to do this is to use a
63d80f118eSChris Lattner"`lexer <http://en.wikipedia.org/wiki/Lexical_analysis>`_" (aka
64d80f118eSChris Lattner'scanner') to break the input up into "tokens". Each token returned by
65d80f118eSChris Lattnerthe lexer includes a token code and potentially some metadata (e.g. the
66d80f118eSChris Lattnernumeric value of a number). First, we define the possibilities:
67d80f118eSChris Lattner
68d80f118eSChris Lattner.. code-block:: c++
69d80f118eSChris Lattner
70d80f118eSChris Lattner    // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
71d80f118eSChris Lattner    // of these for known things.
72d80f118eSChris Lattner    enum Token {
73d80f118eSChris Lattner      tok_eof = -1,
74d80f118eSChris Lattner
75d80f118eSChris Lattner      // commands
76d80f118eSChris Lattner      tok_def = -2,
77d80f118eSChris Lattner      tok_extern = -3,
78d80f118eSChris Lattner
79d80f118eSChris Lattner      // primary
80d80f118eSChris Lattner      tok_identifier = -4,
81d80f118eSChris Lattner      tok_number = -5,
82d80f118eSChris Lattner    };
83d80f118eSChris Lattner
84d80f118eSChris Lattner    static std::string IdentifierStr; // Filled in if tok_identifier
85d80f118eSChris Lattner    static double NumVal;             // Filled in if tok_number
86d80f118eSChris Lattner
87d80f118eSChris LattnerEach token returned by our lexer will either be one of the Token enum
88d80f118eSChris Lattnervalues or it will be an 'unknown' character like '+', which is returned
89d80f118eSChris Lattneras its ASCII value. If the current token is an identifier, the
90d80f118eSChris Lattner``IdentifierStr`` global variable holds the name of the identifier. If
91d80f118eSChris Lattnerthe current token is a numeric literal (like 1.0), ``NumVal`` holds its
92*0fa6c158SChris Lattnervalue. We use global variables for simplicity, but this is not the
93d80f118eSChris Lattnerbest choice for a real language implementation :).
94d80f118eSChris Lattner
95d80f118eSChris LattnerThe actual implementation of the lexer is a single function named
96d80f118eSChris Lattner``gettok``. The ``gettok`` function is called to return the next token
97d80f118eSChris Lattnerfrom standard input. Its definition starts as:
98d80f118eSChris Lattner
99d80f118eSChris Lattner.. code-block:: c++
100d80f118eSChris Lattner
101d80f118eSChris Lattner    /// gettok - Return the next token from standard input.
102d80f118eSChris Lattner    static int gettok() {
103d80f118eSChris Lattner      static int LastChar = ' ';
104d80f118eSChris Lattner
105d80f118eSChris Lattner      // Skip any whitespace.
106d80f118eSChris Lattner      while (isspace(LastChar))
107d80f118eSChris Lattner        LastChar = getchar();
108d80f118eSChris Lattner
109d80f118eSChris Lattner``gettok`` works by calling the C ``getchar()`` function to read
110d80f118eSChris Lattnercharacters one at a time from standard input. It eats them as it
111d80f118eSChris Lattnerrecognizes them and stores the last character read, but not processed,
112d80f118eSChris Lattnerin LastChar. The first thing that it has to do is ignore whitespace
113d80f118eSChris Lattnerbetween tokens. This is accomplished with the loop above.
114d80f118eSChris Lattner
115d80f118eSChris LattnerThe next thing ``gettok`` needs to do is recognize identifiers and
116d80f118eSChris Lattnerspecific keywords like "def". Kaleidoscope does this with this simple
117d80f118eSChris Lattnerloop:
118d80f118eSChris Lattner
119d80f118eSChris Lattner.. code-block:: c++
120d80f118eSChris Lattner
121d80f118eSChris Lattner      if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
122d80f118eSChris Lattner        IdentifierStr = LastChar;
123d80f118eSChris Lattner        while (isalnum((LastChar = getchar())))
124d80f118eSChris Lattner          IdentifierStr += LastChar;
125d80f118eSChris Lattner
126d80f118eSChris Lattner        if (IdentifierStr == "def")
127d80f118eSChris Lattner          return tok_def;
128d80f118eSChris Lattner        if (IdentifierStr == "extern")
129d80f118eSChris Lattner          return tok_extern;
130d80f118eSChris Lattner        return tok_identifier;
131d80f118eSChris Lattner      }
132d80f118eSChris Lattner
133d80f118eSChris LattnerNote that this code sets the '``IdentifierStr``' global whenever it
134d80f118eSChris Lattnerlexes an identifier. Also, since language keywords are matched by the
135d80f118eSChris Lattnersame loop, we handle them here inline. Numeric values are similar:
136d80f118eSChris Lattner
137d80f118eSChris Lattner.. code-block:: c++
138d80f118eSChris Lattner
139d80f118eSChris Lattner      if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
140d80f118eSChris Lattner        std::string NumStr;
141d80f118eSChris Lattner        do {
142d80f118eSChris Lattner          NumStr += LastChar;
143d80f118eSChris Lattner          LastChar = getchar();
144d80f118eSChris Lattner        } while (isdigit(LastChar) || LastChar == '.');
145d80f118eSChris Lattner
146d80f118eSChris Lattner        NumVal = strtod(NumStr.c_str(), 0);
147d80f118eSChris Lattner        return tok_number;
148d80f118eSChris Lattner      }
149d80f118eSChris Lattner
150*0fa6c158SChris LattnerThis is all pretty straightforward code for processing input. When
151d80f118eSChris Lattnerreading a numeric value from input, we use the C ``strtod`` function to
152d80f118eSChris Lattnerconvert it to a numeric value that we store in ``NumVal``. Note that
153d80f118eSChris Lattnerthis isn't doing sufficient error checking: it will incorrectly read
154d80f118eSChris Lattner"1.23.45.67" and handle it as if you typed in "1.23". Feel free to
155*0fa6c158SChris Lattnerextend it!  Next we handle comments:
156d80f118eSChris Lattner
157d80f118eSChris Lattner.. code-block:: c++
158d80f118eSChris Lattner
159d80f118eSChris Lattner      if (LastChar == '#') {
160d80f118eSChris Lattner        // Comment until end of line.
161d80f118eSChris Lattner        do
162d80f118eSChris Lattner          LastChar = getchar();
163d80f118eSChris Lattner        while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
164d80f118eSChris Lattner
165d80f118eSChris Lattner        if (LastChar != EOF)
166d80f118eSChris Lattner          return gettok();
167d80f118eSChris Lattner      }
168d80f118eSChris Lattner
169d80f118eSChris LattnerWe handle comments by skipping to the end of the line and then return
170d80f118eSChris Lattnerthe next token. Finally, if the input doesn't match one of the above
171d80f118eSChris Lattnercases, it is either an operator character like '+' or the end of the
172d80f118eSChris Lattnerfile. These are handled with this code:
173d80f118eSChris Lattner
174d80f118eSChris Lattner.. code-block:: c++
175d80f118eSChris Lattner
176d80f118eSChris Lattner      // Check for end of file.  Don't eat the EOF.
177d80f118eSChris Lattner      if (LastChar == EOF)
178d80f118eSChris Lattner        return tok_eof;
179d80f118eSChris Lattner
180d80f118eSChris Lattner      // Otherwise, just return the character as its ascii value.
181d80f118eSChris Lattner      int ThisChar = LastChar;
182d80f118eSChris Lattner      LastChar = getchar();
183d80f118eSChris Lattner      return ThisChar;
184d80f118eSChris Lattner    }
185d80f118eSChris Lattner
186d80f118eSChris LattnerWith this, we have the complete lexer for the basic Kaleidoscope
187d80f118eSChris Lattnerlanguage (the `full code listing <LangImpl02.html#full-code-listing>`_ for the Lexer
188d80f118eSChris Lattneris available in the `next chapter <LangImpl02.html>`_ of the tutorial).
189d80f118eSChris LattnerNext we'll `build a simple parser that uses this to build an Abstract
190d80f118eSChris LattnerSyntax Tree <LangImpl02.html>`_. When we have that, we'll include a
191d80f118eSChris Lattnerdriver so that you can use the lexer and parser together.
192d80f118eSChris Lattner
193d80f118eSChris Lattner`Next: Implementing a Parser and AST <LangImpl02.html>`_
194d80f118eSChris Lattner
195