Name Resolution

So I thought that I’d get an easy start to Pass#2 by first developing the constant value pre-calculator. It seemed like a logical thing to do for starters because those pre-evaluated values will be needed in the actual compilation of executing code. So I started writing a procedure that iterates all constants from the symbol table, and then calculates their values. I had all these circular reference algorithms planned out and so on. But of course surprises do happen (actually I should have seen this coming from a mile away), and I figured that before I can implement this fancy stuff, there are a few other things to take care of.

Wait, isn’t that the same precise thing I experienced in the beginning of Pass#1… oh, umm.. Yes it is! Back in Pass#1, I was hoping to get an easy start with simple statements such as Repeat…Until. But there’s no such thing as free lunch; I had to write the containing elements first (d’oh, of course!). Those being classes, functions and so on. On top of that, those statements just happened to be the most complicated ones, so plenty of general-purpose parsers needed to be written, too. So Pass#1 was quite frontloaded instead of backloaded a burden to chew on. As I’ve written in previous blog entries, the ending of Pass#1 went relatively quickly, and was just a copy-paste fest for the most part.

So what’s the frontload here in Pass#2? Well, even though you now got your pretty symbol table and the source code on a silver platter, you will still need to solve the identifier name references correctly because constant expressions can have constant identifiers in them. These references couldn’t be made in Pass#1 because identifiers can be declared in random order. Sooo… now you find yourself together with Mr. Name Resolution. Oh, and also, please meet his brother Mr. Overload Resolver. Oh, and don’t forget their sister, Ms. Template Expander. Whoah, so I suddenly have these 3, mean looking procedures on my face – right from the start. I’m not kidding, these are probably THE 3 most difficult entities for Pass#2. And again, they must be almost completely implemented before proceeding onto further tasks. Okey, the Template Expander can wait a little bit. I can probably get this pre-calculator to work correctly on *simple code* with just the Name Resolution algorithm implemented, but the 2 remaining villains need to be taken care of soon after.

Name Resolution is essentially the process of matching source code identifier names to their actual declaration. An identifier can be located in some of the upper levels or in a completely different branch within the program hierarchy. When the identifier reference is being resolved, also the access modifiers are taken into account. All this follows a specific rule-set of the OOP member accessing. Basically, an entity always has access to upper levels, but requires Public access for branches. Inherited class-based reference to base classes can also have Protected access. Combine these to overloading, shadowing (hiding by name) and procedure overriding (hiding by name and signature), and you got yourself a nice little soup. But these are my problems anyways 🙂

I have formulated algorithms for both constant expression calculation and name resolution, and I expect to carry them out during the up-coming weeks. The pre-calculator will be a big chunk of code. At this point in development, things get really interesting (and challenging).

Pass#2 is looming just behind the corner

All those 64 parsers of Pass#1 have now been written and tested. And that’s a lot of code! Virtually every statement has now had its parser implemented, even though some of those statements will probably be disabled for the first few alpha releases. I’m kind of relieved as I’ve now reached a “check point” in the development process. Yet there’s a lot more to come. All in all, I think it’s safe to say that I’m closing in the halfway of the entire compiler now that Pass#1 is ready. The compiler made it through the baptism of fire, by properly parsing a few-hundred lines long complete test class source code.

This means that given the source code of a random test program, the compiler has now performed syntax checks for all elements of it, and that there’s now a complete symbol table for Pass#2 to play with. All metadata has now been gathered… Pass#2 has all the info it needs in order to complete the transformation into the Intermediate Language. If Pass#1 was essentially the Parser, then you can consider Pass#2 the actual Compiler; Its main job is to process the executable code within procedures – and to order it in a meaningful way. Pass#2 is a very, VERY complex process, and it has a lot more tasks than Pass#1 had. I have already assembled a list of those tasks, but I’m not going to share it just yet. I’d rather divide them into smaller topics and discuss them separately in future blog entries.

There are few preceding steps before Pass#2 can start crunching the procedure code. One of them is Interface merging and the other is the pre-evaluation of constant value expressions. The latter is probably more challenging with all this circular reference thing going on. Other “difficult” parts of Pass#2 include Name Resolution, Overload Resolver and the actual Expression Transformer. More about those later. All I wanted to say, was that there’s much work to be done, and it’s getting more difficult now. Bleh, I’m probably going to be spending even more time just thinking things through before implementation.

ValueExpressionParser()

Pass#1 consists of dozens of parsers. All the “container” statements are already implemented and those missing are the “regular” statements i.e statements that can only be written inside a procedure which carry the actual program flow. I had no need to parse value expressions until now as I got started with statements such as If, Until, While, etc. Most of these standard statements include parts where a value is expected. And value is evaluated from an expression. Expressions can contain operators and operands and together they form a mathematical formula. A result of an expression can be a single value of any data type defined within a CoolBasic program.

The constant value expression parser was made long ago when I was writing the Function statement parser. It converts the expression to postfix notation and then attempts to evaluate it. It was perfectly sufficient for the job, so I decided to try out the same technique for the value expression parser aswell. Value expressions allow practically all expression elements whereas the constant parser implementation was merely partial due to the amount of “illegal” elements for a constant expression. The value expression parsing algorithm grew soon in size and wasn’t exactly the most enjoyable to debug with. Also, the more complicated expression elements such as jagged arrays caused all kinds of nasty problems that would have required ugly exceptions to be made for the algorihm. In order to debug the algorihm perfectly I needed long and complex test expressions. And since you’d need full debug info of what’s currently in the output queue, operator stack, and what’s the current state of the expression, it quickly lead to nice little flood in my Compiler debug window.

So I discarded the entire parser code, and started to think of new ways to check validity of a value expression. After sprawling on the couch (once again) for a couple of hours, I came up with a solution. I’d continue doing things just like I had been doing the entire time for Pass#1. Even expressions can be broken into smaller peaces, and each piece can be validated separately. One error, and the entire expression fails. Sounds like yet another great use of recursion, uh?

If you read compiler literature you’d notice that there are mainly two kinds of commonly used algorihms (not including their sub versions): The LL and LR parsers. They both scan expression terminals from left to right, and try to satisfy the requirements of an expression. The LL-parsers do this by forming the entire expression as a product of terminals and operators it comes across with, whereas the LR-parsers try to simplify the expression down to a single product. These algorithms are also known as “Top-down” and “Bottom-up” parsers. Since basically recursion is like a stack, the easiest way for me, was to use the Bottom-up parsing technique. It’s not by the book, but the basics are the same: Every time a value is expected, new recursive call is initiated. I’m not going to delve into details, but it appears to be working like a charm. Later on I also replaced the constant value expression parser.

So what this means, is that unless I hit unexpected problems, Pass#1 is finally coming toward its end. I still have lots of parsers to implement, but most of them are very simple. I should now have almost all bits of a statement, implemented as a parser. It’s just a puzzle left I need to put together with ready-to-use pieces.

Assigning stuff

C-languages (and the like) make difference between the “assign” operator and the “comparison: equal” operator, as opposed to BASIC-languages where both share the same symbol. The other assign operators such as “+=“, “-=“, “*=” and “/=“, in comparison, operate the same way in both languages – almost. In C-languages you can use assign operators pretty much anywhere within expressions because unlike in BASIC-languages, they return the value that was just assigned. That’s why you can use “a=b=c=1” in C, and have all the three variables be assigned a value of 1. In BASIC-languages, where symbol “=” defines the comparison operator, the same statement would equal C-expression of “a=(b==c==1)” which would result in a boolean value of True or False to be assigned to variable a, depending on the values of b and c.

The problem in BASIC-langages is that the compiler can’t tell whether the user wanted to express an assignment or a comparison with symbol “=“, except for the first one in line; the rest are considered “comparison: equal”. BASIC-language parsers typically distinguish these two operators (that share the same symbol) by context: Are we expecting a value here or not. Therefore, by design, BASIC-language assign operators don’t return a value. Unfortunately, nor do any of the other assign operators: “+=“, “-=“, etc. because of consistency. Thus you can’t do loop auto increments like:

While ((myVar += 1) < 10)

In addition, since no value is evaluated by the assignment you can’t have the operation applied before/after evaluation, like the “++” and “--” operators in C do (they can appear before or after a variable). In addition, the “--” as prefix has also a double meaning: it can either be decrement before evaluation, or a double unary minus. Too much noice here. And that’s why those operators are missing from most BASIC-languages including VB.NET – and they won’t make it to CoolBasic either. You will be able to use “i += 1“, but not within a value expression.

In summary, these are the assign operators that will be implemented to CoolBasic V3:
"=", "+=", "-=", "*=", "/=", "<<=" and ">>=".

Edit 15.5.2009: Fixed typos.

Annoying syntactical exceptions

So I have been writing these Pass#1 statement parsers for a while now (a bit more than 20 more to go). Recently, I came across with the Declare Function statement. It’s responsible for declaring a function from an outside resource – for example, a DLL-file. The general guideline is that all statements would look the same as the corresponding statement in VB.NET. So I began the normal procedure I do with every new statement parser, i.e type in a sample code in a source file which will then be passed to the compiler. Now, the first thing to do is to enable the parser function in question and then create the parser skeleton for quick OK check. I was surprised to notice that the compilation failed due to syntax error before the code line got even passed to the main parser function. The VB.NET syntax is as follows:

Declare Function MyFunc Lib "System.DLL" Alias "_RealFunctionName" ( [params] ) As [type]

Now the general syntax rules define that a literal cannot appear just before an opening parenthesis. And well well, isn’t that a string literal and a parameter list just after. I was first like, WTF how can I “fix” this in a non-ugly way without breaking the general syntax rule set. I had three choises, none of which seemed good way to go:

  1. Allow a string literal to precede an opening parenthesis
  2. Allow a string literal to precede an opening parenthesis only in a Declare statement
  3. Alter the syntax of the Declare statement

Definately not Option 1. Option 2 seemed to be the best choise because I still did’t want to alter the syntax of the statement. But that would have been an exception to the syntax rules. I don’t like exceptions in any program structure, because in the end, it will make code maintenance messed up. I thought about pros and cons of options 2 & 3 for a long time. I could have added a small keyword between the string literal and parameter list. The resulting statement syntax could look something like this:

Declare Function MyFunc Lib "System.DLL" Alias "_RealFunctionName" Ansi ( [params] ) As [type]

…where “Ansi” could be, for example, a charset modifier (Ansi/Unicode/Auto). On the other hand, charset modifier is already an optional modifier in the VB.NET syntax; its place is right after the Declare keyword. CoolBasic won’t provide support for charset modifiers just yet (since it’s optional I can add it later on without breaking all existing CoolBasic source codes which would make users angry).

Another solution for Option 3 would be to define Alias as identifier instead of a string. The syntax would look something like this:

Declare Function MyFunc Lib "System.DLL" Alias _RealFunctionName ( [params] ) As [type]

It seemed like a good idea at first, but come to think about it the solution would be logically wrong. By definition, all identifiers should be usable by the programmer. In this case, the Alias identifier would not – or it would be unintended identifier reserving a name without real operational use. At this point I ran out of options. This isn’t by the best practise, but I decided to go for Option 2. Luckily I’ve designed the lexer to be quite flexible so I only needed a small change in one place, basically overriding a syntax error based on a simple condition of few preceding line tokens.

As a result, the syntax will be exactly like in VB.NET, as intended. Internally, the fix didn’t end up being as ugly as I had feared. Overall, I’m content with the outcome. Pass#1 continues…

Parsing in Segments

Pass#1 is responsible for so many things it needs to be designed very carefully. I have come up with a plan about the remaining compiling phases: There will be 3 or 4 passes in total of which Pass#1 is the parser. Its main job is to compile the symbol table for the source code, and generate variances of the generic classes and functions. In short, it will prepare the source code for Pass#2 where the actual compilation of each procedure takes place. Of course, this involves making the code as clean as possible by removing modifiers and certain other structural “particles”.

Parsing in general can be quite difficult a task, but this time I’ve taken entirely different approach. As we begin Pass#1 (for each line separately) some basic syntax checks have already been done. Therefore we can safely assume that the line always begins with:

  • Pointer literal
  • Identifier
  • Label
  • Full-line special data segment
  • Keyword

Each of which launches a mini-parser of their own. These include ParseKeywords(), ParseStatementValue(), ParseAssignment(), ParseParamList(), ParseDeclaration() and ParseModifiers(). They scan the terminals of the current line by segments, i.e group terminals based on which class they belong to. All this is done recursively so in case of a failure, the entire call stack rolls back, ultimately terminating the entire compiling process. The pros of the recursive parsing is that it allows a flexible identification of expressions which belong to a multi-part statement such as the For-statement.

In addition, all in-built statements have their own parsers that make sure the keyword has been used within a proper declaration context, and that those statements that define a code block, follow a valid hierarchical structure. As many syntax checks will be made as possible, but the final compilation in Pass#2 will detect the rest of them.

CoolBasic V3 follows very sophisticated and strict syntactical rules similar to VB.NET. This means that in order to parse a simple Break-statement I must first establish the base for parsing modifiers, Modules, Functions, and Repeat…Until. In another words, I must first set up alot of functions just to get things running by putting it all together for the minimum. Lots of code is required before I can even run the basic tests (for example scope list and the symbol table). Pass#1 is coming along nicely, though.

Compilation: Pass#1

Since I decided to add cross-reference support for CoolBasic (see blog entry “Highly Modular”), some major changes to the entire process were needed. Therefore, I’ve rewritten the lexer in preparation for Pass#1. Due to the nature of these changes, the way inner lists work also needed a review. I’m quite happy with the results. Lexing algorithms improved, for example the assignment operator is now better distinguished from the comparison operator. The assignment is now automatically cast for statements like Optional in function declaration, variable declaration, or the For…Next assignment. This will ease further parsing in Pass#1 and Pass#2.

Pass#1 compiles the symbol table i.e list of all identifiers. Basically, the compiler will scan for declaration statements and then create a collection of their names. Some secondary information such as path, relationship, block entry/end and signature are also formulated. Pass#1 can also process the hierarchy integrity as well as perform some initial checks to the syntax. We can’t create exact pointers to parent entities yet because program structures can be declared in random order within the source. Other tasks include generation of enums and trimming the code from entity modifiers. In general, Pass#1 is just a fast sweep and it doesn’t analyse the syntax in great depth. Deep scan belongs to Pass#2 and perhaps Pass#3.

Constants are interesting because they can be used to define each others, for example Const a = 1 : Const b = a + 3. It’s, however, possible to create a circular reference which renders the evaluation of the constant’s value, impossible. You can’t make exceptions to the general reference resolver (requiring Constants to be always declared before using) either. Being unable to specify constants as part of another constant expression, is stupid, too. This means I have to implement a recursive search at later passes when constants’ true values are pre-calulated. Pass#1 doesn’t yet do anything about the constants’ values anyway. In VB.NET constants aren’t really replacing values within the source; they’re constant variables i.e Static ReadOnly. They are, however, pre-evaluated at compiling time to determine constant expressions and thus discarding false condition code blocks. This is where CoolBasic will differ – constants will be transformed into literal values which means speed increase at runtime.

More about Pass#2 when I’m done with this…

Highly Modular

I spent most of yesterday sprawling on the couch thinking about the fundamentals of programming languages and how, for example, VB.NET has most of its modern features implemented. I came into conclusion that it’s absolutely necessary to discard the idea of one-pass top-down parsing when the symbol table is created. Most of the old-fashion procedural languages (which includes the majority of BASICs), scan the source in two or three passes, but always so that when a new identifier is encountered it must have already been defined; otherwise the compiler would report an “unknown identifier”, and the process terminates.

Since all code in OOP-languages must be written within a container (Module, Class, or an Interface), there’s a problem when you must declare identifiers with each other’s type. There are no “prototypes” to introduce identifiers before actual declaration. The following code illustrates the problem:

Class Class1
Public c2 As Class2
EndClass

Class Class2
Public c1 As Class1
EndClass

The old-fashioned compiler would stop at the 2nd line of Class1 because the abstract data type Class2 is not yet declared anywhere within the preceding code lines. Therefore we need two passes at the minimum when we compile the the symbolic table of the entire source code. Practically speaking, CoolBasic will still compile the symbolic table during the first pass, but some information of each identifier such as its data type doesn’t yet point to the structure of an identified data type. Instead, there’s merely a string-based reference, which will be taken care of in later compile parses.

Structure of a CoolBasic program
The ROOT level i.e source file context can only contain Modules, Classes, Interfaces and the Imports statement. The Module declaration context, can contain Classes, Structures, Procedures, Enums, and variables. Classes can contain Classes, Structures, Procedures, Properties, Operators, Enums, and variables. Structures can contain Structures, Procedures, Properties, Operators, Enums, and variables. Procedures and Blocks can only contain executable code and variable definitions. Those variables are considered Public, but local. A block can’t define a variable that shadows parent level local identifier. The “variable” here also refers to any Constant. Identifier match resolver operates in standard dot notation rules, but each Imports declares a special namespace. If an identifier reference couldn’t be resolved, the compiler will attempt to match it with these namespaces as a prefix of the identifier.

Minimum language feature requirements
Object Oriented Programming unloads some very powerful tools of which I can take advantage of in creating more sophisticated structures. As I was lying on the couch this huge idea came to my mind… could this be possible?! What if you actually wrote some of the very basic features of any programming language by hand… in CoolBasic itself. Things such as strings, arrays or linked lists. So I spent some time mapping what are the bare minimum requirements the language absolutely needs to support, in order to provide everything essential so one could write all other features that are considered “standard” in today’s programming languages. These are my findings:

The language must support at least:

Basic program flow, Modules, Classes (with full Inheritance support), Procedures, Declares, Structures, Variables, Constants, Properties, Operators, Generics, certain in-built Operators, The ROOT data type

Can be implemented later:

Imports, Interfaces, ForEach

Need to be implemented before starting to write the essential classes in CoolBasic itself:

Memory, Console

Can be written in CoolBasic itself:

Integers, Floats, Strings, Arrays, System services

Sophisticated features written in CoolBasic itself:

List, LinkedList, Stack, Queue

Game related:

Graphics System, Object System, Input System, Sound System, etc.

“Imported Packages”
I have this idea of “Open Source” where all these modules written in CoolBasic itself, come with the package. You can see how they’re done code-wise, and you can even modify or optimize them. These modules would be located in an “Imports”-folder or something like that. The compiler would automatically import the standard modules, but nothing will stop you from adding in your own modules and by using the Imports statement, include them into the compilation. There’s one down-side to that, though: It means that *everything* will be compiled *every time*, slowing the process a bit. One way of solving this problem, would be a support for pre-compiled libraries (CoolBasic-made of course), but I need to do some more research about that.

Implementing Strings

Every time I have started refining CoolBasic there’s a number of issues that keep haunting me about how to implement them in the “proper” way. One of these topics is handling strings. Strings are not values because they have dynamic lenght and memory consumption and thus can’t be stored in the stack where all integers and floats reside. Strings work as references instead, meaning that the string variable is really an integer pointing to the actual position in memory which holds the actual string data. Every time a string is modified in some way (when its length changes), a new memory block will be allocated for new string data, and the old string will be freed. Basically, the string pointer changes in order to keep the reference up to date.

Because of this behaviour it’s problematic to free the string with rest of the values when, say, the containing procedure, block or class instance ceases to exist. You’d just lose the pointer, but the string itself remains in memory with no way to access it again! You could require the programmer to manually destroy all reference objects when needed, but it’d be kind of stupid to make strings part of that since they are intrinsic data types (i.e those literal values that come in-built with CoolBasic). That’s why you don’t want to force the use of the “New” keyword when declaring string values and variables. We’re after VB.NETish syntax, after all.

I was first thinking of implementing strings as a true CoolBasic-written class. However, it would mean that the full class source was included at the beginning of every CoolBasic program, increasing line count and thus processing time. It would also require the standard New-assignment for each string. But that was not the main problem since it could easily be overridden by pre-compiler that transforms the syntax during the compilation. In the other hand, had CoolBasic an automatic Garbage Collection system the strings would get deleted automatically with the rest of unreferenced objects. However, CoolBasic does not have such mechanics (the programmer needs to take care of freeing objects by calling their destructors).

The following code will illustrate the loss-of-pointer problem. Consider:

Dim a As String = "A"
Dim b As String = "B"
a = b.Trim()

First, pointer of string “B” will be evaluated and pushed into the stack, followed by function call Trim(). New string with leading and trailing white-spaces removed will be created and pushed into the stack, replacing the original pointer. Now, variable “a” contains pointer to existing string “A” in memory. But the pointer will be replaced in the assignment, and since strings are always unique i.e two variables cannot have reference to the same string in memory, the pointer to string “A” will be lost for good! It will remain in memory with no way to access nor free it in the program. This is called memory leak, and is considered bad programming. It can lead to memory starvation.

Luckily, there are only three things that need to be taken into account to prevent this. Firstly, every time a string literal occurrs, it will be copied and then its (new) pointer gets pushed into the stack. The template never changes, and we now have two identical strings at different memory locations. Secondly, every time there is an assignment to a string type variable, a .Finalize -property will be added by the compiler. And finally, we will take advantage of one of the new features of CoolBasic V3, class properties! Every time a scope (be it a procedure, class destructor or just a code block) ends, there will be automatically added .Finalize=Nothing -calls for all string variables, including arrays. Finalize is a WriteOnly property of the intrinsic String class which acts as a delegate function. This means that we can inject some program code before the actual value assignment happens. Basically, we’ll just deallocate any existing string at the current pointer and then assign a new pointer.

The compiler will transform string assignments like this:

a.Finalize = b.Trim()

Of course this behaviour differs from normal instance variable handling since there can be several references to same object. So far, this is by far the most sophisticated and proper way of handling strings any CoolBasic generation has ever implemented, and I’m very happy about it right now. For the moment, strings are not yet implemented, but at least the concept is now thought through.

New identifier definitions

Today I effectively scrapped the entire AnalyseSyntax()-function in the compiler and wrote it all over again from the scratch. Then I realised that the rewrite wasn’t much better then the original so I scrapped that too. And I did it again. I ended up with version 4 of that procedure. The reason? Gathering the identifiers can be done in many ways. There are good ways and not so good ways. The essential difference of these algorithms derives from only one fact: which ever ends up being the most dynamical one, tends to make future updates easier.

So what is the key that all new identifiers have in common in BASIC languages? It’s the keyword “As”. My original system scanned the modifiers and flag values within statements with no strict relations between them. This resulted in lots of extra conditions and made my code a bit tangly. The next version tried to build statements keyword by keyword eventually forming the statement with all attributes included. Since keywords have dependencies on each other and strictly pre-defined legal order in which they can appear, it was made impossible to make them “conditional” especially when they appear before the actual keyword that identifies the statement.

Then there was an option to use regexp templates on structural representation of the statement. However, this can lead to problems with non-constant syntaxes like function or operator declarations, or array definitions; typically everything that can have comma-separated parameters.

Finally, the best solution (so far) was to scan the entire statement token by token, just like I did in the first place, but this time by defining rules how keywords can be linked. Basically, each modifier and flag value has list of those keywords that can lead to them – including line start. This will ensure that no keyword can be out of place nor out of context. In addition, it should be easy to add new keywords in the future.

So what is now done? It’s still the constants. But I hope I’ve created a base on which I can start building new things soon. Oh, and I did something else, too: You can now define new instances just like in Visual Basic, “Dim i As New k” is equivalent to “Dim i As k = New k

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress