Welcome to the second article on the topic of Automata, Virtual Machines and Compilers. This time around I'm going to show and explain how to extend the original source code in the first article I wrote on this by adding variables into it, and of course operators bound to them.
Please read the first article if you have not already, otherwise this one may not make much sense for you. As for the required knowledge, understanding the previous article and topics in it is most likely enough.
Definitions
Before I jump into code examples and pseudo-code, I should introduce what is added and how it improves the language defined in previous article. While the previous article defined simple calculator, by adding variables the user is allowed to also use memory in a persistent way during the program execution, not only as temporary space for intermediate results during the computation.
Grammar
Each time somebody wants to do an extension to existing language, a grammar has to be extended. In this case the grammar used in previous article is the starting line. The grammar defined just a simple calculator:
<integer> ::= [0..9]+ <add_operation> ::= + | - <mul_operation> ::= * | / <factor> ::= (<expression>) | <integer> <term> ::= <factor> [<mul_operation> <factor>]* <expression> ::= <term> [<add_operation> <term>]*
To extend previous grammar with variables, some more expressions need to be added:
- Identifier - each language allows variables to be declared or used using variable name. It is up to the grammar to solve this problem, therefore an identifier expression needs to be added. I did so using rule:
<ident> ::= [A..z _][A..z 0..1 _]*
Such rule allows for variables beginning with alphabetic character (both lower or upper case) or underscore, followed by any alphanumeric character or underscore. - Assignment operator - there has to be a way to assign result of an expression into a variable. Therefore and assignment operator is introduced. I used simple '=' symbol, although you are free to use others (another common is ':=')
- Type - a variable has to be declared. I decided to go C-like way of variable declaration, which means a type of variable has to be put first. Even though the language here does not support different types as of now, I decided to go with standard word 'int' for this one.
- Statement exit (Punctuation mark) - adding variables without ability to use multiple statements in the language would be boring. For separating the statements a statement exit rule (also denoted as punctuation mark) was introduced - semicolon ';'
These are the bottom level rules, e.g. the rules that are directly resolved to sequence of characters. By adding those we also have to update the rest (e.g. the rules that are resolved to other rules) and add some new rules. Going from top level:
- Statement - the top level one, not anything exciting about this as the rule is straight forward:
<stmt> ::= <expr><punct>
E.g. and expression followed by semicolon. - Expression - in exchange from the previous grammar the addition rule was renamed, the expression in new grammar resolves to either declaration or assignment rule. E.g.:
<expr> ::= <decl> | <assign>
- Declaration - a declaration rule always declares one variable. There can be only one declaration in the statement (compared to C where declaration can be sequenced using sequence operator), always beginning with type rule, followed by identifier rule. And also it has to be possible to initialize the declared variable:
<decl> ::= <type><ident> [<assign_op> <assign>]*
- Assignment - most likely the most complex rule so far. The assignment itself might not sound complex, yet there are two problems around it (both are also related to previous rule). Starting with example for the first problem:
x = 3;y = x = 7;y = x + 7;y = x + 7 = 3;
Let us go one by one. In the first case, it is straight-forward - the value 3 is stored in x. In the second case, the value 7 is stored in both: x and y. In the third case, logically one would state that the value of x is added to 7 and stored in y. Wrong! How does the compiler know whether it is going to write value into x or read from it? And what should even happen in the last case?To overcome this, a l-value principle has to be introduced. The value on the right side (when on the right we read) is written into the left side. The only exception is rule number 2, where it is possible to conside both: x and y, to be l-values. The fourth example is therefore invalid. Such rule is going to be:<assign> ::= <ident> [<assign_op> <ident>]* [<assign_op> <sub>]^ | <sub>
So that Assignment rule is either a Sub rule (which is our former expression - e.g. arithmetic expression), or it contains identifiers in form of l-value (linked with assignment operator) with or without one assignment of arithmetic expression in the end. The '^' symbol stands for 'zero-or-one'-times. - Sub rule - is a former arithmetic rule, other than name itself, nothing has changed for it.
- Factor - while Term rule stayed, the Factor rule is extended. Inside parentheses and Assignment is allowed (e.g. there can be l-value again in parenthesis with expression on right side of assignment operator), otherwise a Factor is and integer or an identifier
This concludes all the changes to the grammar. I recommend picking a paper (or writing short script) and attemping to generate some programs using this grammar. Actually the grammar itself does not generate programs unless you add one more rule that allows generating zero or more statements, e.g.:
<program> ::= [<stmt>]*This way, the grammar is able to generate various applications. For the convenience I'm also attaching full grammar here:
<program> ::= [<stmt>]* <stmt> ::= <expr><punct> <expr> ::= <decl> | <assign> <decl> ::= <type><ident> [<assign_op> <assign>]^ <assign> ::= <ident> [<assign_op> <ident>]* [<assign_op> <sub>]^ | <sub> <sub> ::= <term> [<add_op> <term>]* <term> ::= <factor> [<mul_op> <factor>]* <factor> ::= (<assign>) | <ident> | <integer> <punct> ::= ; <type> ::= 'int' <ident> ::= [A..z _][A..z 0..1 _]* <assign_op> ::= = <add_op> ::= + | - <mul_op> ::= * | / <integer> ::= [0..9]+
Assignment - Second problem
Did I mention Assignment (and also Declaration) have two problems? By introducing the l-value, it is possible to determine whether that variable is going to be read, or written into. But what about the second one?
The second problem is compiler-related, as before actually setting the value into variable the right part of the rule has to be processed (most of the operators can be solved left-to-right, while assignment is to be solved right-to-left). It is not a big deal, as the actual code for instatiation can be printed into assembly after the compiler processed the rest of the rule.
Compilation
Variable Handling
Before going forward describing some of the changes I did (you can see all the changes in the source code accompanying the article), a discussion about where to store variables is necessary. The obvious answer here should be "in memory", which is partially correct, followed by a question: Where in a memory do we store variables?
The program itself splits memory to two blocks once executed - the block where instructions are stored, where the variables can't be stored, and the block where stack (temporary) variables are stored. Of course, all permanent variables should be stored where temporary variables are.
Temporary variables on the stack are always pushed during the arithmetic computation, but once that is finished all of them are poped up. Therefore all the variables can be stored there beginning with offset equal to 0. Following pseudo-code shows handling of identifier:
Ident(boolean declare, boolean lvalue) { // If the identifier is declared (which means it is preceeded by 'type' token) if (declare is true) { // Store variable in a dictionary (key is variable name, value is offset from base stack address) // GetIdent() returns us name of written variable there // stackOffset is value initialized to 0 in the beginning variablesDictionary.insert(key = GetIdent(), value = stackOffset)); // All we need to do is push value in first register, as first register always store value after // computation printLine("push.i32 r0"); // And offset stack offset (so we can store more variables) stackOffset += 4; } // When we are not doing the declaration else { // If the variable is l-value, the value is stored in it (and not read), although we still need // to keep its value in first register (in case there is statment like 'x = y = 1;') if (lvalue is true) { string ident = GetIdent(); // If the variable name is not inside variable dictionary, there is an undeclared identifier // error. if (variablesDictionary.contains(ident) is false) { Expected("Undeclared Identiefier"); } // Print the assembly command, note that '[' and ']' are used to determine address. The // register SP is used as reference and and offset is added. // Due to push/pop instructions, the actual offset at the moment can be different, determining // correct offset and updating that value is left to disassembly stage printLine("mov.mem.reg.i32 [sp+" + variablesDictionary.getValueAtKey(ident) + "] r0 "); } else { // Actually almost the same as previous function, but different assembly command (not to store, // but to read value from memory into register) string ident = GetIdent(); if (variablesDictionary.contains(ident) is false) { Expected("Undeclared Identiefier"); } printLine("mov.reg.mem.i32 r0 [sp+" + variablesDictionary.getValueAtKey(ident) + "]"); } } }
As mentioned in example code, the actual offset from stack pointer at the moment can differ (due to using variables in middle of computation), although when going through assembly it is possible to count the offset and update it (these values are going to differ only during the single statement computation - therefore this approach can be also used when introducing more complex control flow and function calling). This task is left to be performed by disassembler.
Multiple statements and statements
Implementing multiple statements in a row, and statements themselves is straight forward. For the sake of completness the pseudo-code for those is added:
// Statement itself compiles single expression and always matches punctuation symbol (semicolon) // in the end. Statement() { Expression(); Match(PUNCTUATION_SYMBOL); } // Base compilation function, as long as there are statements in source code, it compiles those. Program() { while (Look()) { Statement(); } }
Right-to-Left
One of the mentioned problems was execution order. While in most cases (addition-subtraction and multiplication-division) the used way to perform the computation is left-to-right, in case of assignment it is right-to-left.
While this problem seem hard to solve, let me demonstrate an example. Assuming we have a code:
x = 4 + 7;
Using the left-to-right compilation, the produced assembly would be:
mov.mem.reg.i32 [sp+0] r0 mov.reg.i32 r0 4 push.i32 r0 mov.reg.i32 r0 7 pop.i32 r1 add.i32 r0 r1
Where the first line is generated by assignment (assuming variable x is at base stack address + 0 byte offset), the rest is generated by the right side of statement. If we want to switch the order - what about putting the left side into buffer and printing it after the right side has been resolved and printed? That works, of course using one buffer is not enough, but LIFO stack of such buffers is required (due to multiple assignments in single statement).
The actual resolution in the code might be confusing if put here in form of pseudo code, the implementation can be seen in reference implementation done in C++. In the following code example the impact of such change is shown:
// Statement: x = (y = 4) + 7; // Variable locations: // x -> [sp+0] // y -> [sp+4] // Which equals to: '<ident="x"> <assign> <paren_l> <ident="y"> <assign> <integer="4"> <paren_r> <add> <integer="7"> <punct>' // First the expression in parentheses is solved: '<ident="y"> <assign> <integer="4">' buffer -> 'mov.mem.reg.i32 [sp+4] r0' assembly -> 'mov.reg.i32 r0 4' // Printing the buffer after what is in assembly yields: mov.reg.i32 r0 4 mov.mem.reg.i32 [sp+4] r0 // The expression outside is solved next to: buffer -> 'mov.mem.reg.i32 [sp+0] r0' // Assignment into buffer (as it is l-value) assembly -> 'push.i32 r0' // Left operand of addition is pushed (result of parenthesis expression) assembly -> 'mov.reg.i32 r0 7' // Move second argument to register 0 assembly -> 'pop.i32 r1' // Pop from stack to register 1 assembly -> 'add.i32 r0 r1' // Add those together // Left side is in buffer, Right side in assembly, printing the buffer after it yields (with previous code): 'mov.reg.i32 r0 4' 'mov.mem.reg.i32 [sp+4] r0' 'push.i32 r0' 'mov.reg.i32 r0 7' 'pop.i32 r1' 'add.i32 r0 r1' 'mov.mem.reg.i32 [sp+0] r0'
Which is correct assembly for given example. Obviously after executing this code, we end up with value of 4 in variable y, and value of 11 in variable x - which is correct status after processing given statement.
Side note: As mentioned in one of the comments, generating an abstract syntax tree (which would be tree of structures), and then allowing to process left-then-right or right-then-left subtree is actually equivalent to this, what is actually done is traversing such abstract syntax tree (represented by the containers in the application), and switching execution order using stack on given sub-tree.
Disassembly
While disassembler in previous article was doing just replacement of instructions with numerical value, and possibly storing arguments as another values - variables complicate it a little bit. First of all it is required to add functions to parse register and offset put in into it - although that is straight forward. What is more interesting is calculating the actual offset. Imagine the following example:
int x = 3; int y = 5; x = 4 + y * 3;
The assembly generated by it is:
// Variable locations: // x -> [sp+0] // y -> [sp+4] 'mov.reg.i32 r0 3' 'push.i32 r0 ' 'mov.reg.i32 r0 5' 'push.i32 r0 ' 'mov.reg.i32 r0 4' 'push.i32 r0' 'mov.reg.mem.i32 r0 [sp+4]' 'push.i32 r0' 'mov.reg.i32 r0 3' 'pop.i32 r1' 'mul.i32 r0 r1' 'pop.i32 r1' 'add.i32 r0 r1' 'mov.mem.reg.i32 [sp+0] r0 '
The line where we are reading from variable y is stating to read from [sp+4], basically stack base position incremented by 4 bytes. The problem is, that stack base position is not known during the runtime - only the stack pointer. Stack pointer is offset each push and pop (either increased or decreased, as the operations are only on 32-bit integers, it is increased/decreased always by 4 bytes), therefore a variable counting this offset has to be introduced, allowing to compensate for this.
Note that this is a way how local variables can be stored. On x86 architecture a base pointer also exists (and in future articles on this topic, I would like to introduce procedures and their calling - which would lead to introducing base pointer register), such pointer points to beginning of current stack frame.
Virtual machine
A virtual machine extension reflects modifications done on disassebler. As these two new instructions are introduced:
'mov.mem.reg.i32' 'mov.reg.mem.i32'
Their respective handling is introduced. Their arguments are:
mov.mem.reg.i32
Moves value stored in register into memory at specified address.
- Integer representing register (pointer to memory where 0x0 is start of memory for application) (32-bit integer type)
- Integer Offset from that register (32-bit integer type)
- Integer representing register (32-bit integer type)
mov.reg.mem.i32
- Integer representing register (32-bit integer type)
- Integer representing register (pointer to memory where 0x0 is start of memory for application) (32-bit integer type)
- Integer Offset from that register (32-bit integer type)
For better understanding of virtual machine executing the code, I recommend downloading attached source and looking at Main.cpp file.
Conclusion
While the language supports variable declaration at this point, there is still one more thing missing for making the language full featured - a way to control the flow of the application. Loops and branches - also known as control structures. Adding those would make the language full featured, which means capable of creating literally any application.
As this sounds good, making the language usable needs more steps to take - while the features my vary on personal preference, in general I assume the list is going to look like this (also can be taken as approximate schema for following articles):
- Control structures (branching, loops)
- Procedures and their calling
- Floating point types
- pointer types
- Global variables
- Communicating with game engine
- Linking
- Custom data types (structures)
Also, if you made it here, thank you for reading and don't forget to check out accompanying source and executable (x64, Windows binary + Visual Studio 2015 project).
24 Sep 2016: Ready for Peer Review
24 Sep 2016: Initial release