Highly Modular

I spent most of yesterday sprawling on the couch thinking about the fundamentals of programming languages and how, for example, VB.NET has most of its modern features implemented. I came into conclusion that it’s absolutely necessary to discard the idea of one-pass top-down parsing when the symbol table is created. Most of the old-fashion procedural languages (which includes the majority of BASICs), scan the source in two or three passes, but always so that when a new identifier is encountered it must have already been defined; otherwise the compiler would report an “unknown identifier”, and the process terminates.

Since all code in OOP-languages must be written within a container (Module, Class, or an Interface), there’s a problem when you must declare identifiers with each other’s type. There are no “prototypes” to introduce identifiers before actual declaration. The following code illustrates the problem:

Class Class1
Public c2 As Class2
EndClass

Class Class2
Public c1 As Class1
EndClass

The old-fashioned compiler would stop at the 2nd line of Class1 because the abstract data type Class2 is not yet declared anywhere within the preceding code lines. Therefore we need two passes at the minimum when we compile the the symbolic table of the entire source code. Practically speaking, CoolBasic will still compile the symbolic table during the first pass, but some information of each identifier such as its data type doesn’t yet point to the structure of an identified data type. Instead, there’s merely a string-based reference, which will be taken care of in later compile parses.

Structure of a CoolBasic program
The ROOT level i.e source file context can only contain Modules, Classes, Interfaces and the Imports statement. The Module declaration context, can contain Classes, Structures, Procedures, Enums, and variables. Classes can contain Classes, Structures, Procedures, Properties, Operators, Enums, and variables. Structures can contain Structures, Procedures, Properties, Operators, Enums, and variables. Procedures and Blocks can only contain executable code and variable definitions. Those variables are considered Public, but local. A block can’t define a variable that shadows parent level local identifier. The “variable” here also refers to any Constant. Identifier match resolver operates in standard dot notation rules, but each Imports declares a special namespace. If an identifier reference couldn’t be resolved, the compiler will attempt to match it with these namespaces as a prefix of the identifier.

Minimum language feature requirements
Object Oriented Programming unloads some very powerful tools of which I can take advantage of in creating more sophisticated structures. As I was lying on the couch this huge idea came to my mind… could this be possible?! What if you actually wrote some of the very basic features of any programming language by hand… in CoolBasic itself. Things such as strings, arrays or linked lists. So I spent some time mapping what are the bare minimum requirements the language absolutely needs to support, in order to provide everything essential so one could write all other features that are considered “standard” in today’s programming languages. These are my findings:

The language must support at least:

Basic program flow, Modules, Classes (with full Inheritance support), Procedures, Declares, Structures, Variables, Constants, Properties, Operators, Generics, certain in-built Operators, The ROOT data type

Can be implemented later:

Imports, Interfaces, ForEach

Need to be implemented before starting to write the essential classes in CoolBasic itself:

Memory, Console

Can be written in CoolBasic itself:

Integers, Floats, Strings, Arrays, System services

Sophisticated features written in CoolBasic itself:

List, LinkedList, Stack, Queue

Game related:

Graphics System, Object System, Input System, Sound System, etc.

“Imported Packages”
I have this idea of “Open Source” where all these modules written in CoolBasic itself, come with the package. You can see how they’re done code-wise, and you can even modify or optimize them. These modules would be located in an “Imports”-folder or something like that. The compiler would automatically import the standard modules, but nothing will stop you from adding in your own modules and by using the Imports statement, include them into the compilation. There’s one down-side to that, though: It means that *everything* will be compiled *every time*, slowing the process a bit. One way of solving this problem, would be a support for pre-compiled libraries (CoolBasic-made of course), but I need to do some more research about that.

Properties

As a new feature, CoolBasic V3 will support class properties. You may have already seen a hint for property support when I discussed about the String implementation earlier in this blog. Properties work just like any other Public variable, i.e you can access then by using the dot notation, but with a added twist. Normally, when you either assign or read a value from a Public variable nothing else will happen. But with properties you are able to capture this event, and execute your custom code to get the job done. Basically, we’re talking about old-fashioned way of creating getter and setter -functions. Without properties you’d have to write these two functions in order to access, say Private variables:

MyClass.SetName("hello")
Console.WriteLine(MyClass.GetName())

With properties you could do the same in more decent way:

MyClass.Name = "hello"
Console.WriteLine(MyClass.Name)

You still define the getting and setting code within the property structure, but done so they’re unaccessible from outside. So you have nice little black box of which workings not even the class members know about. There’s a difference in using the property name than assigning the value to the private variable in question by hand – even when all the setter does is the assign. There’s always code involved when accessing property values.

In addition you can define the property as ReadOnly or WriteOnly. By doing this, you protect some data that must not be modifier from outside. One example of a ReadOnly property would be the length of a string. The property returns the value of a Private variable, however this information is related to the actual data within the String class, and should only be updated when this inner data changes. It’s the only way of keeping such information in sync in a secure way. Now if you come to think about it, constants are ReadOnly, too (except that CoolBasic compiler doesn’t work like .NET compilers in this regard).

Since properties are really a set of methods, they can be inherited. This also means that things like overriding, are possible. However, at this point in time, I’m still a bit unsure about one thing called Default property. A property can be defined as Default when you access the object without providing further dot notation. Sound like normal object reference? Well, there’s a catch. In VB.NET Default properties always need atleast one parameter which means you must include a set of brackets to it. A typical example would be the myVar.Item(index) property you could refer to just myVar(index). I don’t like this because it’s easy to mix this to a method call.

On the other hand, I would still have to allow parameters for properties because it’s the only way to implement indexes to them. Like above. If I ever wanted to implement LinkedLists or Lists or any variant, I, in fact, must allow this. There are still lots of questions floating I need to chew on.

Scope rules in CoolBasic V3

Warning, wall of text incoming!

“In computer programming, scope is an enclosing context where values and expressions are associated. The type of scope determines what kind of entities it can contain and how it affects them. Typically, scope is used to define the visibility and reach of information hiding.”

The scope of an entity’s name is the set of all declaration spaces within which it’s possible to refer to that name without qualification. In general, the scope of an entity’s name is its entire declaration context; however, an entity’s declaration may contain nested declarations of entities with the same name. In that case, the nested entity shadows, or hides, the outer entity, and access to the shadowed entity is only possible through qualification.

So much for theory (the quoted text was from Wikipedia). Now, what does that mean for CoolBasic…

In VB.NET and C# there are mainly two kinds of scopes: We have a root/namespace/class/structure context, and then we have a method/block context. The main difference between the two is how data can be referenced in Object Oriented Programming. Class instance variables (objects) store their values in heap because they are reference types and their life span is not determined beforehand by the compiler. Procedure variables, however, store their values in stack. It’s a static space in memory where procedure variables temportarily reside, but from which the space is automatically freed when the life span of all those variables ends, i.e the procedure exits. Because of the stack, recursive calls are possible.

It’s important to realize that the program structure is highly hierarchical in Object Oriented Programming. It’s all about references to other objects and their parent-child relationship. Gone are the days of procedural programming and direct function calls where the only way to share data between two functions, was to introduce a global variable to hold that data. Now, global variables are quite bad a concept and frowned upon by most programmers today. That’s because in general, it’s often difficult to keep track of the value of a global variable plus all entities that can modify it. In the end, there’s more cons than pros when we’re dealing with large and complex projects.

I will not talk about global variables (or the lack thereof) just yet. Nor will I define concepts of Inheritance, Shadowing or Overriding because in all probability I’m going to do so at some point in future anyway. For quick reference, “inheritance” means reference-based access to a parent class from a derived class. “Shadowing” means identifier matching to the nearest level upwards within the program tree. “Overriding” means priority of methods, operators and properties through inheritance for identifiers with matching signature. “Implementing” means copying functinalities from an interface-template.

In VB.NET and C# there are some restrictions what you can and can’t declare within each type of entity. The basic concept is that all code is written inside a class. The only exception would be the root context which itself behaves like a class. However, root level should have the main object created as soon as possible so the program can be redirected to true class-based execution. Classes can have methods (functions), operators (kind-of functions, but executed as operator instead of value), properties (a public attribute with functional purpose) and even inner classes. In addition, there can be structures (similar to classes), enums, constants, and attributes (variables). I’ll describe some restrictions that were mentioned in the MSDN documentation, but were never explained or justified why you can’t do these things. I had to figure these out myself, and there might still be some caps behind my logic. However, trying to understand the problem by oneself tends to be quite an enriching and valuable experience.

A function cannot be declared within the Block or Method context
A Block context does not have an identifying name that you could use to access it by using the dot notation. This itself is logically wrong. In addition, the member access operator (dot) only works with references, and neither block nor function is a reference. This means that functions don’t get unique memory addresses that are visible to the programmer. Even if you had a pointer to the function, it’d be still pointing to the code entry point and not to the actual member variables. Function members can never be Public. They can’t be Protected either, because functions cannot inherit from other functions. So function members are Private which means you couldn’t access them from the outside anyways. Function variables and their memory addresses will be determined during runtime because they reside in Stack. This offset can’t be calculated at compiling time. References, in the other hand, point directly to member variable addresses so the reference is always direct and valid.

This, however, doesn’t explain why the compiler couldn’t just enforce Private functions within the Block/Method context (all identifiers are local in these context levels). Firstly, there’s the logical issue: Why on earth would anyone want to defined private functions inside other functions. It’s like restricting yourself from data you’re eligible to use anyways – for no gain. Secondly, there’s the technical issue. While I *could* allow this kind of declaration in CoolBasic (even though it’s forbidden in .NET-languages), I think that it would be better for the compiler’s sake to prevent from mixing two separate scope definitors. Private/Public/Protected are intended to define reference context accessibility, the inheritance thing. Whereas keyword Dim is meant to define identifiers within nested local scopes. There’s no equivalece to Public in Method/Block context (if there was, you’d be able to call conditional code from outside, which could potentially break the program flow integrity), as explained above.

So let’s just assume that we have a function inside a local scope (another function or a block) with no separate access modifier in the declaration. The declaration is now legal as long as there are no inheritance-related modifiers or flags. We call this inner function only from inside of the hosting scope. What’s wrong with this? Well… nothing really, besides the odd visibility and reach. Can we implement this? Probably. Should we allow this? I don’t know. It just happens that when you enter a method scope you can, in fact, declare an identically named inner function. In this case, which function does a recursive call direct to? In normal rules of shadowing, it would be the same scope of the calling code unless qualified with Me or MyBase. The problem is, these two variables always refer to a class so you can’t point to the inner function. Either you use Me.myFunc() for the outer function or just myFunc() for the inner function. Also, it matters whether you make the recursive call from inside the inner function or from outside of it. Add different function signatures and you have a nice soup of failure.

As what comes to declaring classes within functions, the same rules kind of apply. You can’t use function name as parth of the dot notation, to point to a class or nested function. Even though the issue would not be reference-based at runtime (procedures get their fixed location within ASM at compiling time). It’s about the syntax and whether or not it’s allowed to do inconsistent/logically wrong things. Function references within the source code should always contain a set of brackets – which itself satisfies the member access operator (dot) because such a token will be identified as value evaluation.

A structure or enum cannot be declared within the Block or Method context
First of all, being unable to declare structures within procedures, but still being able to do so outside (which the function could use), made no sence to be. There shouldn’t be any restrictions becasue you can declare variables normally. Only so that the new structure/enum would be visible locally within the method scope. Some rules that were explained above still apply, but I think I have come up with a better explanation. Picture this: You have a structure called “myStruct” within a class context. Then you have a method that takes a typed variable (myStruct) in as a parameter. Because the scope has now changed, you can declare another structure that has the same name within the method. When the source code gets parsed for the first time, all identifier definitions are scanned. While it wouldn’t normally matter where you have the structures defined and still being able to use them from anywhere within its scope, it just happens that variables are identifiers, too. So they get scanned in the same way. You *could* do multiple passes for the source code to scan in layers, but in OOP this can be a bit problematic and slow. So in this scenario, you can define a variable as “myStruct” before and after the inner Structure is declared. Which means we have two different types for those variables even though they were typed using the same name. Confusing, eh?

Conclusion
When you think about it, the answer to all these problems could be the simplest one: They don’t want you to write structural definitions (such as structures, enums, functions or classes) to a code block which has executable code part (and thus no no declaration part that is separate). The basic priciple is that everything belong to their little boxes, and only access modifiers, inheritance and shadowing rule out information hiding and reach because they work in both ways in the program hierarchy. I think there’s enough reason already to conclude this think-tank into following the rules the wise guys at Microsoft have agreed upon, and not to allow nested functions in CoolBasic V3.

New identifier definitions

Today I effectively scrapped the entire AnalyseSyntax()-function in the compiler and wrote it all over again from the scratch. Then I realised that the rewrite wasn’t much better then the original so I scrapped that too. And I did it again. I ended up with version 4 of that procedure. The reason? Gathering the identifiers can be done in many ways. There are good ways and not so good ways. The essential difference of these algorithms derives from only one fact: which ever ends up being the most dynamical one, tends to make future updates easier.

So what is the key that all new identifiers have in common in BASIC languages? It’s the keyword “As”. My original system scanned the modifiers and flag values within statements with no strict relations between them. This resulted in lots of extra conditions and made my code a bit tangly. The next version tried to build statements keyword by keyword eventually forming the statement with all attributes included. Since keywords have dependencies on each other and strictly pre-defined legal order in which they can appear, it was made impossible to make them “conditional” especially when they appear before the actual keyword that identifies the statement.

Then there was an option to use regexp templates on structural representation of the statement. However, this can lead to problems with non-constant syntaxes like function or operator declarations, or array definitions; typically everything that can have comma-separated parameters.

Finally, the best solution (so far) was to scan the entire statement token by token, just like I did in the first place, but this time by defining rules how keywords can be linked. Basically, each modifier and flag value has list of those keywords that can lead to them – including line start. This will ensure that no keyword can be out of place nor out of context. In addition, it should be easy to add new keywords in the future.

So what is now done? It’s still the constants. But I hope I’ve created a base on which I can start building new things soon. Oh, and I did something else, too: You can now define new instances just like in Visual Basic, “Dim i As New k” is equivalent to “Dim i As k = New k

Analysing the syntax

The compiler is currently at a stage where syntax of program basic structures need to be validated. This pass involves parsing and data gathering, from which the first refers to syntax cheking of classes/functions/constants etc, and the latter is basically forming the Token Table for the program. There are 9 data structures to do: constants, enumerations, program flow, jumps, structures, variable definition, functions, operators and classes.

There is no “properties” in this list. While this is something that would be exciting to implement, I’ve decided not to – just yet. Properties are part of classes that involve usually two functions: getters and setters for certain attributes. Traditionally programmers used to have these two separate functions, but it’s more elegant to have normal assignments and values in spite of they actually perform more code under the hood. You could use something like this:

myClassVariable.SetMyPropertyValue(myValue)
variable = myClassVariable.GetMyPropertyValue()

However, the following way is cleaner:

myClassVariable.myProperty = value
variable = myClassVariable.myProperty

Both of these would still perform a function within the class. I’ll keep this item on my TODO-list, but let’s try to establish the core first 🙂

Back to the analysing process. So those 9 keyword statements need to be parsed in order for CoolBasic to be able to compile a list of all token names. Naturally, each token must have additional data assigned to them such as access modifiers, inheritance modifiers, overriding modifiers and certain flag values. I don’t want to attach all these fields to every token data because not all of them is never needed at the same time and this would only munch unneeded memory for large programs. So it’s quite challenging task to create any “optimal” way to store this kind of data that could bend to all needs at once.

The programming tool I’m using to create CoolBasic V3 compiler doesn’t really support classes and inheritance properly (woot! why can’t you just use Visual Studio?! – It’s not fast enough ^^). I’m using allocated memory blocks to store only the most relevant data. This will make my code somewhat uglier than I had hoped, but at least it’s memory efficient. Note to self: Just remember to maintain the clean-up functions accordingly, mmkay…

And finally, some good news. The consants are already done 🙂

Comment parsing

A few things about source code comments worth mentioning. First of all, both singleline comments and multiline comments will still exist in V3. Comments will be written pretty much the same way they’re written in C (and the related) languages. That is, singleline comments beginning with double slashes “//”, and multiline comments introduced as block that starts with “/*” and ends with “*/”. No more remstart/remend.

For now, multiline comments are stripped already in the file reading pass, and not in the lexical phase (singleline comments are handled here, though). I thought it’d be more efficient in terms of compilation time if lines that wouldn’t be processed anyway, were ignored completely instead of going through the normal tokenization. Due to this, comment ending and starting can’t occurr within a logical line. Obviously this would be easily fixed, but it’s extra work and some code recycling – which is rarely a good thing. This means that the “/*” and “*/” must be on their respective lines alone in a similar manner to how remstart and remend currently work. It may require some time to get used to this, but as it stands now, my decision is justifiable.

Optimizing and pre-evaluating

Many present compilers try to simplify mathematical expressions during compile time in order to ease the burden of the final executable. Naturally this results in faster runtime code execution in some cases and even the size of the executable may shrink a bit. I’ve been working on some similar mechanics for CoolBasic V3 compiler, and at this point in time I’ve grown pretty satisfied with it. Most constant expressions will now be pre-calculated, meaning that statements get shorter before the actual syntax checks begin. All of this will help me to establish as bug-free development as possible. Compilation based on very very strict rules, in the end, tends to prevent exceptions in the algorithms. Simplicity is beautiful.

This kind of optimization is, however, very hard to carry out perfectly. An an example of this matter would be the following expression to optimize well:

1 + 2 + a [into] 3 + a

But grouping is a bit trickier:

1 + a + 2 [into] 3 + a

For now, I decided to leave as it is i.e. only the first example remains in effect.

Another funny part of this implementation of pre-evaluation is that the compiler’s source code almost doubled just by adding this feature. You can’t really say it wasn’t worth it, though. As a result, you gain:

  • Faster runtime execution
  • Faster compilation
  • Smaller executable
  • False conditions (like If/Until/While 0) can be stripped from the final executable

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress