1d80f118eSChris Lattner===================================================== 2d80f118eSChris LattnerKaleidoscope: Kaleidoscope Introduction and the Lexer 3d80f118eSChris Lattner===================================================== 4d80f118eSChris Lattner 5d80f118eSChris Lattner.. contents:: 6d80f118eSChris Lattner :local: 7d80f118eSChris Lattner 8d80f118eSChris LattnerThe Kaleidoscope Language 9d80f118eSChris Lattner========================= 10d80f118eSChris Lattner 11*0fa6c158SChris LattnerThis tutorial is illustrated with a toy language called 12d80f118eSChris Lattner"`Kaleidoscope <http://en.wikipedia.org/wiki/Kaleidoscope>`_" (derived 13d80f118eSChris Lattnerfrom "meaning beautiful, form, and view"). Kaleidoscope is a procedural 14d80f118eSChris Lattnerlanguage that allows you to define functions, use conditionals, math, 15d80f118eSChris Lattneretc. Over the course of the tutorial, we'll extend Kaleidoscope to 16d80f118eSChris Lattnersupport the if/then/else construct, a for loop, user defined operators, 17*0fa6c158SChris LattnerJIT compilation with a simple command line interface, debug info, etc. 18d80f118eSChris Lattner 19*0fa6c158SChris LattnerWe want to keep things simple, so the only datatype in Kaleidoscope 20d80f118eSChris Lattneris a 64-bit floating point type (aka 'double' in C parlance). As such, 21d80f118eSChris Lattnerall values are implicitly double precision and the language doesn't 22d80f118eSChris Lattnerrequire type declarations. This gives the language a very nice and 23d80f118eSChris Lattnersimple syntax. For example, the following simple example computes 24d80f118eSChris Lattner`Fibonacci numbers: <http://en.wikipedia.org/wiki/Fibonacci_number>`_ 25d80f118eSChris Lattner 26d80f118eSChris Lattner:: 27d80f118eSChris Lattner 28d80f118eSChris Lattner # Compute the x'th fibonacci number. 29d80f118eSChris Lattner def fib(x) 30d80f118eSChris Lattner if x < 3 then 31d80f118eSChris Lattner 1 32d80f118eSChris Lattner else 33d80f118eSChris Lattner fib(x-1)+fib(x-2) 34d80f118eSChris Lattner 35d80f118eSChris Lattner # This expression will compute the 40th number. 36d80f118eSChris Lattner fib(40) 37d80f118eSChris Lattner 38*0fa6c158SChris LattnerWe also allow Kaleidoscope to call into standard library functions - the 39*0fa6c158SChris LattnerLLVM JIT makes this really easy. This means that you can use the 40d80f118eSChris Lattner'extern' keyword to define a function before you use it (this is also 41d80f118eSChris Lattneruseful for mutually recursive functions). For example: 42d80f118eSChris Lattner 43d80f118eSChris Lattner:: 44d80f118eSChris Lattner 45d80f118eSChris Lattner extern sin(arg); 46d80f118eSChris Lattner extern cos(arg); 47d80f118eSChris Lattner extern atan2(arg1 arg2); 48d80f118eSChris Lattner 49d80f118eSChris Lattner atan2(sin(.4), cos(42)) 50d80f118eSChris Lattner 51d80f118eSChris LattnerA more interesting example is included in Chapter 6 where we write a 52d80f118eSChris Lattnerlittle Kaleidoscope application that `displays a Mandelbrot 53d80f118eSChris LattnerSet <LangImpl06.html#kicking-the-tires>`_ at various levels of magnification. 54d80f118eSChris Lattner 55*0fa6c158SChris LattnerLet's dive into the implementation of this language! 56d80f118eSChris Lattner 57d80f118eSChris LattnerThe Lexer 58d80f118eSChris Lattner========= 59d80f118eSChris Lattner 60d80f118eSChris LattnerWhen it comes to implementing a language, the first thing needed is the 61d80f118eSChris Lattnerability to process a text file and recognize what it says. The 62d80f118eSChris Lattnertraditional way to do this is to use a 63d80f118eSChris Lattner"`lexer <http://en.wikipedia.org/wiki/Lexical_analysis>`_" (aka 64d80f118eSChris Lattner'scanner') to break the input up into "tokens". Each token returned by 65d80f118eSChris Lattnerthe lexer includes a token code and potentially some metadata (e.g. the 66d80f118eSChris Lattnernumeric value of a number). First, we define the possibilities: 67d80f118eSChris Lattner 68d80f118eSChris Lattner.. code-block:: c++ 69d80f118eSChris Lattner 70d80f118eSChris Lattner // The lexer returns tokens [0-255] if it is an unknown character, otherwise one 71d80f118eSChris Lattner // of these for known things. 72d80f118eSChris Lattner enum Token { 73d80f118eSChris Lattner tok_eof = -1, 74d80f118eSChris Lattner 75d80f118eSChris Lattner // commands 76d80f118eSChris Lattner tok_def = -2, 77d80f118eSChris Lattner tok_extern = -3, 78d80f118eSChris Lattner 79d80f118eSChris Lattner // primary 80d80f118eSChris Lattner tok_identifier = -4, 81d80f118eSChris Lattner tok_number = -5, 82d80f118eSChris Lattner }; 83d80f118eSChris Lattner 84d80f118eSChris Lattner static std::string IdentifierStr; // Filled in if tok_identifier 85d80f118eSChris Lattner static double NumVal; // Filled in if tok_number 86d80f118eSChris Lattner 87d80f118eSChris LattnerEach token returned by our lexer will either be one of the Token enum 88d80f118eSChris Lattnervalues or it will be an 'unknown' character like '+', which is returned 89d80f118eSChris Lattneras its ASCII value. If the current token is an identifier, the 90d80f118eSChris Lattner``IdentifierStr`` global variable holds the name of the identifier. If 91d80f118eSChris Lattnerthe current token is a numeric literal (like 1.0), ``NumVal`` holds its 92*0fa6c158SChris Lattnervalue. We use global variables for simplicity, but this is not the 93d80f118eSChris Lattnerbest choice for a real language implementation :). 94d80f118eSChris Lattner 95d80f118eSChris LattnerThe actual implementation of the lexer is a single function named 96d80f118eSChris Lattner``gettok``. The ``gettok`` function is called to return the next token 97d80f118eSChris Lattnerfrom standard input. Its definition starts as: 98d80f118eSChris Lattner 99d80f118eSChris Lattner.. code-block:: c++ 100d80f118eSChris Lattner 101d80f118eSChris Lattner /// gettok - Return the next token from standard input. 102d80f118eSChris Lattner static int gettok() { 103d80f118eSChris Lattner static int LastChar = ' '; 104d80f118eSChris Lattner 105d80f118eSChris Lattner // Skip any whitespace. 106d80f118eSChris Lattner while (isspace(LastChar)) 107d80f118eSChris Lattner LastChar = getchar(); 108d80f118eSChris Lattner 109d80f118eSChris Lattner``gettok`` works by calling the C ``getchar()`` function to read 110d80f118eSChris Lattnercharacters one at a time from standard input. It eats them as it 111d80f118eSChris Lattnerrecognizes them and stores the last character read, but not processed, 112d80f118eSChris Lattnerin LastChar. The first thing that it has to do is ignore whitespace 113d80f118eSChris Lattnerbetween tokens. This is accomplished with the loop above. 114d80f118eSChris Lattner 115d80f118eSChris LattnerThe next thing ``gettok`` needs to do is recognize identifiers and 116d80f118eSChris Lattnerspecific keywords like "def". Kaleidoscope does this with this simple 117d80f118eSChris Lattnerloop: 118d80f118eSChris Lattner 119d80f118eSChris Lattner.. code-block:: c++ 120d80f118eSChris Lattner 121d80f118eSChris Lattner if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* 122d80f118eSChris Lattner IdentifierStr = LastChar; 123d80f118eSChris Lattner while (isalnum((LastChar = getchar()))) 124d80f118eSChris Lattner IdentifierStr += LastChar; 125d80f118eSChris Lattner 126d80f118eSChris Lattner if (IdentifierStr == "def") 127d80f118eSChris Lattner return tok_def; 128d80f118eSChris Lattner if (IdentifierStr == "extern") 129d80f118eSChris Lattner return tok_extern; 130d80f118eSChris Lattner return tok_identifier; 131d80f118eSChris Lattner } 132d80f118eSChris Lattner 133d80f118eSChris LattnerNote that this code sets the '``IdentifierStr``' global whenever it 134d80f118eSChris Lattnerlexes an identifier. Also, since language keywords are matched by the 135d80f118eSChris Lattnersame loop, we handle them here inline. Numeric values are similar: 136d80f118eSChris Lattner 137d80f118eSChris Lattner.. code-block:: c++ 138d80f118eSChris Lattner 139d80f118eSChris Lattner if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ 140d80f118eSChris Lattner std::string NumStr; 141d80f118eSChris Lattner do { 142d80f118eSChris Lattner NumStr += LastChar; 143d80f118eSChris Lattner LastChar = getchar(); 144d80f118eSChris Lattner } while (isdigit(LastChar) || LastChar == '.'); 145d80f118eSChris Lattner 146d80f118eSChris Lattner NumVal = strtod(NumStr.c_str(), 0); 147d80f118eSChris Lattner return tok_number; 148d80f118eSChris Lattner } 149d80f118eSChris Lattner 150*0fa6c158SChris LattnerThis is all pretty straightforward code for processing input. When 151d80f118eSChris Lattnerreading a numeric value from input, we use the C ``strtod`` function to 152d80f118eSChris Lattnerconvert it to a numeric value that we store in ``NumVal``. Note that 153d80f118eSChris Lattnerthis isn't doing sufficient error checking: it will incorrectly read 154d80f118eSChris Lattner"1.23.45.67" and handle it as if you typed in "1.23". Feel free to 155*0fa6c158SChris Lattnerextend it! Next we handle comments: 156d80f118eSChris Lattner 157d80f118eSChris Lattner.. code-block:: c++ 158d80f118eSChris Lattner 159d80f118eSChris Lattner if (LastChar == '#') { 160d80f118eSChris Lattner // Comment until end of line. 161d80f118eSChris Lattner do 162d80f118eSChris Lattner LastChar = getchar(); 163d80f118eSChris Lattner while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); 164d80f118eSChris Lattner 165d80f118eSChris Lattner if (LastChar != EOF) 166d80f118eSChris Lattner return gettok(); 167d80f118eSChris Lattner } 168d80f118eSChris Lattner 169d80f118eSChris LattnerWe handle comments by skipping to the end of the line and then return 170d80f118eSChris Lattnerthe next token. Finally, if the input doesn't match one of the above 171d80f118eSChris Lattnercases, it is either an operator character like '+' or the end of the 172d80f118eSChris Lattnerfile. These are handled with this code: 173d80f118eSChris Lattner 174d80f118eSChris Lattner.. code-block:: c++ 175d80f118eSChris Lattner 176d80f118eSChris Lattner // Check for end of file. Don't eat the EOF. 177d80f118eSChris Lattner if (LastChar == EOF) 178d80f118eSChris Lattner return tok_eof; 179d80f118eSChris Lattner 180d80f118eSChris Lattner // Otherwise, just return the character as its ascii value. 181d80f118eSChris Lattner int ThisChar = LastChar; 182d80f118eSChris Lattner LastChar = getchar(); 183d80f118eSChris Lattner return ThisChar; 184d80f118eSChris Lattner } 185d80f118eSChris Lattner 186d80f118eSChris LattnerWith this, we have the complete lexer for the basic Kaleidoscope 187d80f118eSChris Lattnerlanguage (the `full code listing <LangImpl02.html#full-code-listing>`_ for the Lexer 188d80f118eSChris Lattneris available in the `next chapter <LangImpl02.html>`_ of the tutorial). 189d80f118eSChris LattnerNext we'll `build a simple parser that uses this to build an Abstract 190d80f118eSChris LattnerSyntax Tree <LangImpl02.html>`_. When we have that, we'll include a 191d80f118eSChris Lattnerdriver so that you can use the lexer and parser together. 192d80f118eSChris Lattner 193d80f118eSChris Lattner`Next: Implementing a Parser and AST <LangImpl02.html>`_ 194d80f118eSChris Lattner 195