Compiler news

1,900 word wall-of-text incoming…

I returned to CoolBasic project after three months of learning XNA 3D game programming, and have been continuing to develop the CoolBasic Classic compiler for the last two weeks. Although the main focus has been with the compiler I also designed a new look & feel for the web portal, and it’s being reviewed by the rest of the Team. This entry, however, will be all about the compiler and what’s been done recently.

Can it produce byte code yet? Yes, to some extent. The underlying mechanics are in place, but nothing is written to any file yet. Basically, it can process expressions. This includes (but is not limited to) calculating the result value of a constant expression, resolving names (i.e. mapping the identifier names to their declared symbols), function overload resolution, automatic type conversions, and filling in the missing arguments for optional function parameters. A lot of thought has been put to optimize performance here. In fact, all those things mentioned above are done in a single iteration over the postfix presentation of the expression. I had to inspect how the .NET framework internally works with arrays, stacks, lists, and type conversion to make sure my algorithms were efficient enough. I actually ended up writing a few own implementations for those highly specialized cases I needed (where the framework equivalent would allocate memory in an inefficient way or do more work than needed, for example).

I’m about to go technical…

Expression pre-evaluation
Where I left off before my XNA experimentations, was the constant expression pre-evaluation. This is when the actual expression processor needs to be implemented. You first convert the expression into postfix notation and then evaluate it. Naturally constant expressions can contain constant symbols as well, so a circular reference safety check exists. Identifier names are resolved during evaluation because we need to know the data type of each value in order to determine the final data type and the ultimate value in case of a constant expression.

The expression processor is a single method any statement or symbol processor can call. Therefore we don’t know whether the expression is a constant expression or not (i.e. whether its value can be fully pre-calculated). When a symbol is encountered the first time, it will be “processed” (as in a constant symbol’s value needs to be known before it can be used in other expressions). Since the expression processor can be called by a function symbol’s own processor, in order to pre-evaluate optional parameters’ values, a circular reference safety check must exist for functions as well.

The constant symbol and parameter symbol processors simply check whether the result value was a constant and give an error if it isn’t. In addition, before the pre-evaluated value is assigned to a constant symbol or parameter symbol, it’s converted to the destination data type. I had to write my own optimized routines for this because the .NET framework is clearly a bit too slow with anything<-->string. And I learned how much pain in the ass it can be to convert a double value to string (just have a look at dtoa.c to get the idea) – I ended up implementing a much simpler algorithm for that conversion.

Even though constant value expressions are fully pre-calculated, I plan to add an expression simplifier for non-constant expressions as well in the future. It would basically turn expressions like a+b+2*20-c into a+b-c+40. However, this is a very difficult topic that will need much research in terms of grouping and ordering analysis, and I simply don’t want to be held back because of it (even though I have a partial implementation of it in the V3 compiler already). I don’t see it that important at the moment (in order to get this thing out someday I’ll leave it for now).

Name resolution
The expression is processed once token by token. Each encountered identifier is resolved by its name. This process is called name resolution. I have three special resolvers: Type resolver, Symbol resolver, and Overload resolver. The type resolver is called by a symbol processor (each symbol is validated by its own processor). A symbol processor exists for all symbol types. Most symbol processors resolve the associated data type, but some do additional work such as constant symbol processor who calls the expression processor in order to cache the pre-calculated value. The type resolver is the simplest one: since all type symbols must be declared in the root scope we can direct the search to that to begin with, and also only look for symbols classified as Type.

The identifier resolver is a tad more complex. It is told the context symbol and/or scope and whether the search should be locked to that context only or should it also be extended to upper levels if the name is not found immediately (one example of a locked context would be the dot notation path “a.b.c”). Unlocked search is recursive. In CoolBasic Classic, the main program exists in root scope, but it is not the root symbol; the Global symbol is the ultimate root node of the symbol tree, and all imported framework/engine functions such as LoadImage or MoveObject exist there. This means that you can override them by defining your own functions with the same name. The reason we tell the resolver both the context symbol and scope is that, for example, functions’ parameters don’t have scope during optional expression evaluation. In order to access constant symbols defined in the main program’s context, the search needs to get one symbol level up instead of scope level. Furthermore, functions’ local scopes are isolated from the main program (for obvious reasons), and local identifier, a variable or constant, name resolution cannot therefore access the main program’s base scope’s local variables and constants.

The overload resolution is an interesting one. First of all, name resolution has already succeeded on a function group symbol. Yes, we now have an extra container for all functions of the same name. It stores all overloads of that function. All we need to do is to pick the right overload and map the identifier to its symbol. So we pass a snapshot of the calculation stack to the overload resolver, and based on the data types of those values, the most appropriate overload is chosen. This means that you could have two functions of the same name, say Abs (like ‘absolute value’) that takes and returns an integer, and one that takes and returns a float. The compiler would then pick the right one based on the context in the expression, avoiding unnecessary casting and loss of precision due to choosing the wrong one. Should there be more than one equally qualified overload candidates, an error is generated. Should there be no qualified overloads at all, an error is also generated.

Completing expressions
Now what’s really new here is the injection system. It allows the addition of tokens in the middle of the token list without causing “memory move” operations like List.Insert() does (and yet we’re not operating with a linked list here – it’s not efficient enough in .NET, and I want to avoid the memory overhead generated by it). Omitted arguments for those function parameters classified as “optional” are a good example of injected tokens. But the feature also has one other very important use, for we’re also injecting type conversion tokens. This works with intrinsic data types (integer, float and string). Whenever a value of the wrong type is provided, CoolBasic Classic will try to convert the value to the correct destination type. For example, if you provide array indices as floats, they’ll be automatically converted to integers. Similarly, the function overload resolver tells back which values need to be converted and to which intrinsic data type.

We’ll probably be adding explicit type conversion operators to the language at some point as well; we’re just not sure about their naming yet. Just don’t be disappointed if you don’t see them in the first alpha or beta.

Another cool feature I’ve mentioned before is the presence of short-circuit And and Or operators. They are now fully implemented. They’re different from other operators in that while pre-calculation occurs in the same way as processing any binary operator in a postfix calculation stack, they actually produce byte code with conditional jumps instead of a single operator instruction. The hardest part was to infer the offset of the jump because type conversion instructions as well as loading omitted optional parameters to the instructions list occurs all the time (so that the offset cannot be determined during the conversion from infix to postfix). However, I came up with a clever mechanic that allows the reliable offset calculation without having to give up the idea of a single processing pass.

Byte code generation
I mentioned at the start of this blog post that I’m already able to produce byte code that would calculate the expression’s value in real Cool VES environment. A lot of different small parts had to be implemented before this goal was reached, but I think it’s now working quite well. The expression processor, in addition to calculating the result value, generates the full instruction set, including conversion instructions, short-circuit magic, and injected parameters.

One particularly tricky part was to implement dot notation path processing. While simple paths such as “a.b.c” are quite easy to pull off (just lock the name resolution context to the data type of the previous member’s value), it gets a little more complicated when assignment and arrays come in to play. I hate “exceptions to processing rules” so I had to come up with a unified model that just supports normal values, dot fields, dot array fields, and normal arrays. Array variables are always pointers to their actual buffers, but the value indicated by the index (or indices) on top of stack must be read by another instruction. And since context must be locked for the name resolution to succeed, unlike functions, the array field loader instruction cannot occur after argument values (and you cannot access elements not on top of stack which I could’ve done by creating my own custom stack structure, but it’d still be fundamentally wrong thing to do). So for an array access you need two instructions; one before and one after the arguments. Just like how C# and VB.NET compilers do it. It works now. Actually, byte code generated by CoolBasic Classic compiler is VERY similar to Microsoft CIL (Common Intermediate Language) generated by .NET compilers. For example, there’s no single instruction for operator “<=”, but it’s expressed with the combination of cgt, ldc.i4 0 and ceq instead – i.e. “not greater”.

Still few operators lack implementation of pre-evaluation support, but those should be painless to write in no time. One of these is the assignment operator (that again is a bit of a “different” case from the others), but perhaps I’ll get into that in the next blog post. Now this huge central piece is mainly done, the next big goal is to start iterating actual statements like If, While, For etc. Pretty much branching in general.

TL;DR
Expressions are now processed, and refined into real byte code that can execute on Cool VES platform. The next phase would be to process all statements, and with the help of the expression processor, create the final byte code output we can execute!

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress