The new parser is underway!

It’s been a bit silent ever since I finished the V3 compiler – oh well, kind of. Even thought the compiler was fully functional, I realized that it had to be re-written in order to gain more performance. Due to drastic syntax changes of the key elements I can’t copy-paste the previous code much at all. And, writing basically the same old code again is – big surprise – boring. I also finished my thesis so I felt complete, and wanted to take a few weeks off. I’ve been working hard one year straight after all.

So things have progressed a little slowly for the past few weeks, but at least they have. I didn’t want to blog until I had finished a certain feature, and that’s the support for conditional compiling. I demoed the feature in Assembly Summer 2009, but it has now been enhanced! Conditional compilation practically means that the programmer can affect at compile time which parts of the source code will or will not be compiled. Because CoolBasic V3 compiler optimizes all conditional branches, including If, While, and Until, by stripping constant expressions that evaluate to False, from the final Intermediate Language, the #If and If statements were essentially identical. However, conditional compilation now gets evaluated immediately after lexical analysis so that False branches don’t get processed in Pass#1 or Pass#2 at all!

These so-called directive expressions must evaluate to a constant value. Because the directive expressions will be evaluated before statement parsing, regular constant variables can’t be resolved. Thus, there will be no name resolution (as we understand it in OOP), but instead, CoolBasic introduces directive constants which can be declared via the #Const statement. These constants are separate from the regular ones you’d declare via the standard Const statement, and they will not conflict by name. Directive constants are always Private to the file they are declared in. In addition, it’s possible to define global directive constants at project level, but that will be a feature of the code editor.

In contrast to regular constants, directive constants can be declared multiple times with the same name, and it’s possible to assign a different value to them each time. You can use existing directive constants in directive expressions when declaring other directive constants. Imagine the possibilities for implementing different sound engines, for example. The following example will shed some light about the concept:

#Const DEBUG = True
#If DEBUG
    // This will get compiled, and is available in program
    Function A() 
    EndFunction
#Else
    // This is unavailable in program – unless #Const DEBUG = False
    Function B() 
    EndFunction
#EndIf
#Const DEBUG = Not DEBUG // Redefine the DEBUG directive constant

As directive expressions are processed so early, I had to write yet another postfix converter and calculator for the purpose. God that was boring. At least now it’s done. Next, I should be able to start writing statement specific parsers and general-purpose parsers.

I may (or may not) have a mega surprise next time. Stay tuned.

Introducing the new Lexer

Last time I showed you some test results of the performance of CoolBasic V3 compiler. The numbers weren’t too appealing when the compiler was fed a huge special kind of a test file. The structure of that certain file caused some compiler bottlenecks to amplify, which resulted in slower compile time than expected. All this got me into thinking of more optimized ways to apply lexical analysis, parsing, analyzing and code transformation. In my previous blogging I introduced some visions of mine in how to fix those issues, and the modifications to the lexer are now in place.

The lexer is executed in the beginning of the compilation process. It’s responsible for identifying tokens from the file stream and then saving the tokens to a linked list for later processing. This new implementation also uses the new fast FileReader, and it features full widechar conversion and Unicode support. The lexer also now utilizes the new GlobalMemoryCache system and super-fast linked list system. In addition, the compiler now runs in both 32-bit and 64-bit mode. So, lots of new technologies have already been implemented, and the base to build the parser and analyzer upon, is eventually materializing. It wasn’t easy to come through, though.

My first idea was to use hand-written Assembly code in order to jump between different lexer states without actually needing to test any conditions for it. Basically the FileReader just called asm jmp to certain label address that was dynamically stored in a variable. It was faster than a procedure call, and certainly faster than a procedure call + set of conditions to test the lexer state against, but I found out the hard way that this wasn’t the most secure way of controlling the program flow: I experienced some odd side-effects I couldn’t explain what caused them, so I eventually threw the Assembly code out of the window, and did it in the “right” ™ way. In addition, in 64-bit mode you’d have to use the qword modifier in order to perform the asm jmp anyway, which for some reason, is slower than in 32-bit mode – in fact it would’ve been a bit slower than a standard procedure call.

During my “happy” Assembly experimentation I experienced very slow compilation (probably lost some hair and/or beard because of it). First I thought it was because of my bad asm-cookings, but the problem persisted even after I removed all hand-written Assembly from the source. After lots of digging and peeking, I finally found the cause: It was just one little procedure that gets called only once per line, but which took 0.8 milliseconds to complete every time. For big test files (11,000 lines+) this adds up. It took me countless of hours and lots of cursing, but when this little issue had been identified and properly handled, the lexer had finally reborn.

The lexer is now indeed very quick – I’d say I achieved my goal. The difference compared to the previous V3 lexer (which also was quite fast for “normal” code i.e. a code somebody would actually write and not artificially created and absurdly large test files) is notable: The new lexer outperforms the previous V3 lexer by up to 6X! The performance was measured with 5 artificially generated, and relatively large, test files as follows:

File size: 4.5Mb
Physical code lines: 11,000
Average line length: 1,100 characters
Consists of keywords, identfiers (6-10 charactes each), integral and decimal numbers (5-7 characters each), parentheses and wide range of different operators
1.6 million tokens
File read + full lexical analysis: 0.36 seconds on average

File size: 12.2Mb
Physical code lines: 11,000
Line length: 1,137 characters
Consists of identfiers (7 charactes each)
1.6 million tokens
File read + full lexical analysis: 1.54 seconds on average

File size: 12.5Mb
Physical code lines: 11,000
Line length: 1,163 characters
Consists of integer numbers alone (6 digits each)
1.8 million tokens
File read + full lexical analysis: 0.31 seconds on average

File size: 14.3Mb
Physical code lines: 11,000
Line length: 1,329 characters
Consists of double decimal numbers alone (6 digits + decimal separator each)
1.8 million tokens
File read + full lexical analysis: 0.39 seconds on average

File size: 12.1Mb
Physical code lines: 11,000
Average line length: 1,150 characters
Consists of random language keywords (83 different kinds of)
1.4 million tokens
File read + full lexical analysis: 0.87 seconds on average

Even though most processing is done in memory, the lexer doesn’t read several megabytes long files completely into memory. This keeps memory pressure in manageable boundaries.

And in the end…
It’s now possible to use the underscore as a line continuation character, thus the following code is considered to be one logical line:

Function MyLongFunc( _
    parameter1 As Integer, _
    parameter2 As Integer, _
    parameter3 As Integer, _
    parameter4 As Integer, _
    parameter5 As Integer _
)

Improved Compiler Performance

Hello again! After 3 weeks of hard writing, my thesis is nearly finished. It only lacks a few edit rounds, so I can now eventually return back to work with CoolBasic V3 compiler. I’m sorry to say this, but I still have lots of work to do until I can start thinking of taking any actions to make the compiler public to anyone – including the DevTeam. As always in software engineering, surprising things tend to pop up, and almost every time they result in extended development time and higher cost. Fortunately I’m not under the pressure of any deadlines nor do I have to pay anything, but this is going to take considerably more time as originally intended. Let me explain why.

Before my break, I conducted some pressure tests on the compiler. I artificially generated a huge source code consisting of nearly 10,000 symbols and close to 2 million tokens. The average line length was around 1,000 characters. Those lines contained overly complicated expressions no-one would ever write anyway, but this kind of a code should be a good benchmark to test the true performance of the compiler. Now, even though CoolBasic V3 compiler is quite fast, this particular structure of the test program caused some of the compiler’s characteristics to exaggerate, causing cumulative increase in compile time.

The huge test program compiled in 12 seconds. Needless to say I was quite surprised. I knew this code could take a few moments to compile, but to be honest, I didn’t expect to see it taking that long. I found this intriguing, and I quickly ported the exact same code over to VB.NET, and ran the VB.NET compiler manually from command line. The process completed in 2 seconds, meaning CoolBasic was 6 times slower than VB.NET compiler. I’m too much of a perfectionist to let this be, because I know I can do better than that.

At this point I’d like to emphasize that “normal code” i.e. natural mix of “normal length” code lines, empty lines and indentations as well as wider diversity of different kinds of statements would probably bring CoolBasic compiler closer to VB.NET compiler. However, if I come to know that there’s a potential problem, I will fix it – even though nobody would ever write this kind of an absurd code. That’s indication of quality. The fact alone that something awful is possible, troubles me. Now, someone could think that this is all big waste of time, but the truth is that this is one of those things you’d better fix as early as possible because *if* somebody someday founds it, then you’d be screwed.

So I started to map which parts of the compilation process took longest to complete. The results amazed me: Those 12 seconds broke down to three sections: lexing (3.0 seconds), parsing (0.9 seconds), and code analyzing & transformation (8.5 seconds). I was mostly surprised by the lexer that munched one quarter of the entire process. After further investigation it occurred to me that the root of all evil was the file reading process (now what’d ya know!) I never saw that one coming. The compiler reads one line at a time from the source file, and then passes it to the lexer. OK, I was aware that this might not be the fastest way possible, but I didn’t think it’d make much difference.

The lexer needs to scan every character individually, utilizing the Mid-function of PureBasic. Now it appears to be that in case of exceptionally long strings (like 1000+ characters we have here) the Mid-function is underperforming. I wrote my own in assembly, and managed to squeeze some extra speed, but the lexing process still took, in my opinion, too long. Eventually I found out that the other string functions of PureBasic, were too slow for the job as well. So I started brainstorming alternative ways of implementing the file reader and lexer.

Another bitter surprise was when I found out that PureBasic’s in-built linked list support (something that CoolBasic V3 compiler makes great use of) is not fully optimized for high performance – at least not for my needs. During those 2 million tokens, there will be hundreds of millions of requests, iteration, deletions, and additions to those lists which ultimately add up resulting in slowish compilation for very lengthy expressions (such as in the test program).

Sure thing there are some things I could have done a bit differently that would’ve helped to some extent, but in the end not much. I’m a little upset of all this because for the most part it’s not my fault. I learned an important lesson though: you want it right, you do it yourself! While I was writing my thesis, I hatched a few great ideas I believe will make a great difference. Perhaps those ideas will bring CoolBasic V3 compiler very close to VB.NET compiler (that has now become the goal). Below I have listed some of the actions I’m taking in order to substantially improve the performance of the CoolBasic V3 compiler (and I think they’re well worth the extra time it takes to implement them):

Full x64 support
To be honest, better to implement this now. Or otherwise I would have to do it later anyways, which would probably be too difficult a task and most likely cause the entire compiler to be re-written from the scratch. This in turn would probably yield a good amount of new bugs rendering end-users not so happy. 64-bit architecture is quite exciting a feature although the runtime engine won’t probably support it immediately. As opposite from what I started the CoolBasic project with, I now have a 64-bit Windows Vista at my disposal, making it possible to develop and run 64-bit code.

Full Unicode Support
This is another thing that needs to be taken into account at a very early stage in development. Not only Unicode support enables the compiler to read ANSI, UTF8, and UTF16 (both Little Endian and Big Endian) encoded files, this will also ensure that all pre-compiled modules are compatible with each other. This also means that strings in CoolBasic V3 take up 2 bytes for each character. Also, it’s now finally possible to use virtually any localized character in source file identifiers, comments and so on.

Hi-performance FileReader
The huge test file was 4.5 megabytes big. On my laptop, reading every line, and then individually scanning each character, took 3.4 seconds in total. There was no further processing to read characters, which makes the elapsed time so alarming. However, the new file reader processed the exact same task in 0.04 seconds, resulting in 85x better performance – and that’s with additional widechar conversion. Also, this time there will be no string operations during the entire V3 compilation process. Even string comparison during name resolution is done via hi-performance smart selection algorithm and memory API.

Hi-performance Lexer
With the change to FileReader, also the lexer needs to be re-written. It will now be based on state information rather than an algorithm similar to regular expressions that was used before. I’m injecting manual assembly in order to jump directly to different token handlers. Obviously, this is breaking a number of “best practices”, but the speed gain is notable enough to justify. The lexer also features a dynamically reallocatable text buffer which makes identifier, keyword, and string building very fast. Some additional testing still needs to be done, but overall I expect a prominent speed gain over the previous lexer.

Hi-performance TokenList
While I was brainstorming, I suddenly got this awesome idea of implementing a linked list of my own as a replacement over PureBasic’s slowish implementation. Token list is one of the most used entities during the compilation, and its speed is vital in order to perform well with large projects (such as complex games). I was eager to test my theory, so I wrote a bunch of linked list related functions, basically forming entire base of a linked list library. Its features are much more diverse and precise when compared to linked lists in PureBasic, offering functions such as AddAfter, AddBefore, AddFirst, AddLast, and RemoveRange.

The test program had a linked list of 2,000,000 token elements, so the estimated memory usage was 46 megabytes for 32-bit version. During the test, 700,000,000 element additions and deletions were made in total, targeting various locations within the list – some at the beginning, some in the middle, and some at the end. Coolbasic V3 list implementation completed the task in 17.0 seconds on average, whereas PureBasic’s equivalent test program completed in 34.3 seconds on average. This means that CoolBasic’s version is more than 2X faster. In addition, memory usage at the peak for CoolBasic V3 compiler was 52 megabytes and 72 megabytes for PureBasic. I was happy.

If we test simply how fast the system walks through the list, and not making any allocations or deallocations there, PureBasic catches up: CoolBasic compiler is now “only” 1.6X faster: Fully iterating 2,000,000 tokens long linked list 1,000 times took 17.1 seconds on average for CoolBasic, and 27.2 seconds on average for PureBasic. The algorithm behind this hi-performance linked list will remain my little secret.

Hi-performance SymbolTable
The second most used entity during the compilation process is with no doubt the symbol table; there can be millions of queries to it. I admit I could’ve designed its iteration more powerful in the first place, but the symbol table is also linked list based, and thus doomed to underperform a task that I know can be made a lot faster. Therefore, using the new linked list technology described earlier, I can now not only gain speed for regular manipulation of the list, but also do some things that were simply not possible before. For example, I no longer have to change the list pointer every time I wish to add an element at the end of the list.

I’m also going to re-design the structure of the symbol table. Although consisting of a single linked list, each node can now store detailed information about its children as well as pre-determined pointers to next and previous node of the same parent node. This technique ensures that only those elements that need to be examined, will walk through iteration, which will obviously result in swift table lookups – combined to new hi-performance resolvers. The new structure isn’t fully taken its final shape yet, but eliminating big part of search time every time an identifier is confronted ought to have a great impact on performance in general.

Hi-performance GlobalMemoryCache
Manual memory allocation is a crucial part of the V3 compiler. This process can be improved by packing the data tight in larger memory “pages”. Actual allocations occur more seldom, and the stored information is easier to transfer and cross-link between modules. Also, when a compile job finishes, it’s easier for me to deallocate everything with a single call rather than iterating something through, and then deallocating pieces here and there when necessary.

Hi-performance Resolvers
Currently, CoolBasic V3 name resolution differs from VB.NET/C# name resolution a little bit. It was a conscious decision to allow the resolving process to continue even though a non-accessible branch was found because there could be more accessible option presented within the upper levels. Due to this design, the resolvers were implemented as recursive functions. They needed to pass information between each calls in a linked list, which has proved to be inefficient in PureBasic. However, for the sake of performance, this no longer seems like a good solution, given that VB.NET and C# operate differently in this case. So, I’m probably going to change this technique, which should relieve some burden from this core compiler functionality. While doing so, I hope I can combine the type resolver and general identifier resolver into one, more manageable entity. By this change I will also gain a number of other straits as well, but I don’t wish to turn this blog post into any more technical babble than it already is.

And in the end…
Needless to say that these are all pretty big changes, and yes, the compiler needs a major re-write to implement all those things. The current V3 compiler is by no means bad, and I’m very proud of it. It’s just that if things could be done better, it’s too tempting to let them be as they are – especially since CoolBasic V3 isn’t out yet. Utilizing the knowledge I’ve gained during this journey, the new core shouldn’t take anywhere as long to complete as the original V3 compiler did. I, too, wish to get V3 out some day, you know 🙂 I think these changes are necessary, and are well worth the extra time.

Moral of the story? Low level programming for performance is good, high level programming for performance is bad. I’m really sorry about the delay since I couldn’t have known about these weaknesses of PureBasic until it was too late. You want it right, you do it yourself!

The compiler FrontEnd is now finished!

Hello again! As I promised last time, I have two big announcements to make. Even though summer has been warm and beautiful, I’ve been working very hard on CoolBasic compiler for the past few weeks. We’re talking about something like 8 hours aday on average, but there were also a few of those days during which I coded 11 hours pretty much non-stop. And it’s been great! Even though most of the work has been nothing but bug fixes, my motivation has all but diminished. Some people might think otherwise, but I actually get encouraged by intriguing bugs that take 24 hours to track down because it’s very rewarding to ultimately fix those.

I don’t normally mention about bugs I’ve found and fixed, but be it exception this time: I had a medium-sized architectural issue regarding method invocation and how the reference is passed to them. Static methods and structure methods add an extra twist. The original implementation had fixed stack position for the reference – which ultimately proved to not work out for complex dot notations that involve static members in the middle. I fixed the problem by re-writing the part that is responsible for recognizing valid references and injecting them into the final code. References are now always at the same relative location they appear in the original statement. Even static methods now have them although those don’t necessarily point to anything. Long and complex dot notation paths in general have been giving me quite a lot of work recently.

The first announcement
I’m proud to announce that I have now finished the Intermediate Language (later referred to as “IL”) entirely. This effectively concludes Pass#2 and thus the compiler FrontEnd is now considered “finished”. It’s been of a long road, and reaching this major milestone is a great relief to me. What this means in terms of progress? The IL still needs yet another transformation, but this time it’ll be final. So basically there’s just one process to make for the compiler, and then I can start working on the runtime. Also, it’s July already, and I think the devTeam won’t be assembled this summer but in autumn perhaps. There just is so much work to do, sorry. I’m about a month late of the original schedule I had. But I’d say it’s normal because in program of this nature surprises tend to emerge. Some of which force medium-to-large re-writes of existing code. Modifications always affect somewhere else and thus extensive testing needs to be done after architectural changes – be it minor or major. Thankfully, I didn’t have to implement major changes to anything (atleast not just yet).

I spent 5 days (there were quite many bug discoveries and fixes involved) to write an extensive array of Unit Tests. A Unit Test is basically a standalone fragment of code that responds to input and output; they are used to test program functionality isolated from the big picture, and each test runs in its own sandbox. These small programs I wrote, test all CoolBasic programming language features. After compilation I examined the emitted IL in detail and ensured that everything went well and that the IL was correct. As a result of this process, I gathered interesting data and statistics I’d like to share with you. I think the numbers speak for themselves:

Test Program:

– Code size: 2547 lines in 7 files, 62 kb in total
– Composition: natural mix of comments, declarations and executing code. All language features
– Symbol table size: 1150 identifiers
– Emitted IL: 7410 lines

Testing Platform:

– HP HDX18-1000EO (Laptop)
– T9400 processor
– 4GB memory
– Windows Vista (64-bit)

Results:

– Number of compilations: 20
– Average compile time: 105 milliseconds
– Fastest compile time: 97 milliseconds

That’s pretty fast compilation considering the complexity of the process. Overall I’m happy with the outcome and that “pressure test”. I’ll be using the same gigantic Unit Test program to validate everything when any new feature gets added later on.

The second announcement
Now that the pressure is mostly gone, there’s something completely different I need to do: I still haven’t finished my thesis, so I’m technically still a student. That’s something I’ll soon need to take care of as the official time limit of graduation is closing to its end (and I don’t wish to extend it). As some of you may know, I already work as professional software engineer, so that’s taken some time and focus off the studies. The only thing remaining (and the only reason I’m still in their register as “pending graduation”) is the unfinished thesis. So I’ll take a few weeks off CoolBasic, and finish up my studies in Bachelor of Business Administration in Information Technology. Graduation and receival of the official “shiny paper” will also be a great relief, and after that I can fully focus on CoolBasic once again. I will start the thesis from a blank table very soon. During this time, however, I will be watching the forums, and am available via IRC (nothing new there). My four-week summer holiday is starting on 20th July so I have lots of resources to use on the thesis (although I don’t wish to spend all the summer indoors). I will also be participating in Assembly Summer 2009 this year!

Some random thoughts
I reviewed CoolBasic V3 manual. After careful consideration, I came to conclusion that the black theme is unfitting to such material that is to become the primary textual source of information for Coolbasic V3. Although silver text is still very readable on black background, you can actually read dark text on white background longer without the eyes growing tired. I’ve taken a little more MSDN-like approach although I plan to enhance the visual look and feel.

It’d be cool to allow the online manual to be somewhat interactive and more dynamic. Features such as User Comments, Version History, Multilinguality, and linking to outside sources as well as flexible Maintainability will basically require a database. Thus, online version of the manual is best to separate from offline version. I’d still like to use XSLT for the offline manual, but then the end users couldn’t use Firefox to view it – or otherwise I’d have to make the manual very dull looking basically ignoring text styling within paragraphs. So there’s still some open questions to the offline version, but database approach for the online manual certainly looks quite promising. Either way, it’s always possible to generate the offline manual directly from the database. Database design, however, will be a challenge!

Enjoy the summer, folks!

CoolBasic Intermediate Language

Perhaps the title of the previous blog entry could’ve been better thought out. For some reason, I keep reading it wrong “IL logic is gone”, and I’m not sure if I’m the only one 🙂 . In addition, “program logic” is a term that’s basically the less technical alternative to “algorithm”. In the previous post, I was referring to how the execution bounces between “labels” generated by program flow control structures such as conditions and loops. Anyways, for today’s post I also have news about Intermediate Language, only this time I actually mean how separate lines are executed expression-wise (which also can be referred to as “logic”).

I have now almost finished those necessary functions that are responsible for generating the high-level instructions from each line of the original source code. This is essentially the base for “Byte code” that is used by runtime interpreters. The order of tokens, in CoolBasic Intermediate Language, is fixed so that they can be processed and calculated very efficiently at runtime. You can learn more about postfix notation in this wiki article.

The following CoolBasic code:

Module M
	Function R()
		Dim i As Integer
		i *= i + (6 / i) - i * 2
	EndFunction
EndModule

… will compile into:

.method public shared m.r ()
{
.size 4
IL_000002: ldloc     m.r.i
IL_000003: ldloc     m.r.i
IL_000004: ldc.i4    6
IL_000005: ldloc     m.r.i
IL_000006: div       
IL_000007: add       
IL_000008: ldloc     m.r.i
IL_000009: ldc.i4    1
IL_000010: shl       
IL_000011: sub       
IL_000012: mul       
IL_000013: stloc     m.r.i
IL_000014: ldc.i4    0
           IL_prog_6_end:
IL_000016: ret       
}

Operator data type pairs don’t show there, but those are to be determined later. Please also notice that this is not a complete program. I merely wanted to give you a demonstration that CoolBasic Intermediate Language generation is quite mature already. You can also see some code optimization in this example. However, there’s much more into it than simply performing binary shifts instead of multiplication or division in certain situations. The CoolBasic compiler is able to generate the equivalent IL from almost any language feature now.

IL generation from expressions also means that most if not all of the error messages (and so error checks) are done, and that every statement has reached their final design. During this time I have fixed quite a lot of bugs, none of which has proved to be overwhelming enough to compel me to come up with a work-around. There’s also been some bigger challenges such as property behavior with the direct assign operators (+=, -= etc), field access with long dot notation paths, constructor calls, and IL generation for special operators (to be revealed later). Minor and small problems usually take only minutes to solve, but bigger ones can take up several hours until I come up with a solution of some sort. Looking back, those “difficult” problems have solved quite nicely, and now that I think of all the benefits I’ve gained thanks to them, there’s no gum code into it after all.

In addition to technical coding, I’ve also established the new www-portal framework I’m experimenting with. CoolBasic web page will feature lots of dynamic content in the future, maybe even some kind of a publishing interface for admins. I’m also thinking of ways to bind this dynamic content to the development environment. As a small side-note I’ve also spent some time on new icon candidates for CoolBasic files, and at least the Vista versions look quite nice already.

The next blog entry will in all probability be just an announcement of two things: for one, of me having finished IL generation entirely. About the second I don’t want to talk about yet.

IL logic is done

It’s about three weeks since my last blogging. Sorry for that, I know some of you pay big interest to my doings and how CoolBasic develops. That said, there’s a good reason why I’ve been so silent: I have been implementing some exciting new features and I didn’t want to write a wrap-up until all of them were fully done and things are actually working. As Pass#2 is closing to its end, every statement needs to be fully operational so that they can be transformed into IL. This also means that I had to define the missing constructors and other required methods, and implement special behaviour to where it’s needed. For example, strings and arrays tend to require some exceptional processing, mind you all of which is invisible to the user. Both arrays and strings work like a class internally, but syntax in BASIC-languages dictate simplier and cleaner standards to express them (as what comes to creating, initializing and using them). In addition, strings are fully managed, arrays are not. But more about those topics in future coming blog entries.

CoolBasic compiler now being at a stage where the IL is being generated basically means that every statement has now undergone their final parses. Remember when I talked about parsing and analyzing back in Pass#1? Well, this “parsing” continues also after the resolvers have been executed. Pass#1 merely certified that those lines were syntactically mostly correct and eliminated lots of extra work for Pass#2. Now that all possible data we need has been gathered and analyzed by earlier sub-processes of Pass#2 (and the rest of all syntax checks have been performed), all statements get their vital parts being isolated and ultimately converted into Intermediate Language. I just finished the last remaining IL logic parser which means that it’s time to focus on the actual IL. By “IL logic” I mean the way program flow control stuctures such as If…EndIf, For…Next, and Repeat…Until form their final “jump there – do that” labels and expressions. There’s still work to be done in order to finish IL creation.

Basically, expressions still need to be broken into simple instructions like stack push/pop and mathematical operator calls. And that’s the next task to do – finishing up the IL generation. This also includes the creation of class initializator and finalizator methods as well as generating the global and local variable spaces for classes. I hope that the compiler frontend is going to be finished at early summer. When that is done, I can start focusing on higher level CoolBasic language services such as linked lists of various sorts. My plan is to provide “somewhat complete” version for the closed alpha (to be announced at a later time).

So… lots of cool things have been added to the most recent development build of the compiler, lots of fixes and improvements have been made, and also lots of testing have been conducted as a side product of all these. Overall, everyting is looking bright at the moment, and no major problems have occurred so far. I’d like to conclude this blog post to the following code snippet:

Dim k As Integer = 5

If 1 + 2

	Dim i As Integer // This is a local variable that is only visible in *this* scope
	Dim k As Integer // Error! Local variable hides another local variable

EndIf

If k

	Dim i As Integer // This is completely different variable "i"

EndIf

i = 10 // Error! Variable "i" is not declared in this scope

The plan to complete Pass#2

Every now and then (usually after something bigger gets finally implemented) comes the time to perform a “maintenance round” to clean up and finish the latest added features. That’s essentially what I’ve been doing lately. I want to keep it clean and tidy. I also thought that now would be the appropriate time to transform the compiler into what it’s intended to be in the end: a working console application. Until now I have had the input file hard-coded as well as the function call, to perform the actual compilation. Well, not any more. The compiler now gets called by commandline which defines all input files and/or additional options. All passed files then get parsed and compiled nicely one by one. The compiler will not be a DLL (as hinted in the previous blog entry) nor will there be any visible Windows form. There will not be a progress bar. Instead, the compiler does write some info to the Standard Output Stream – which can be directed straight to the future-coming CoolBasic editor. Possible compile time errors can be parsed from this string data. In addition, the compiler returns an exit code back to the system which tells was the compilation successfull or did it fail. Overall, this resembles how Visual Studio performs building.

I also took another round of standardizing the error messages (mainly those that get generated by Pass#2). Currently, there’s a great deal of them, and I estimate there to be a few hundred of them in total by the time the compiler will be finished. And I must say… Oh boy I’m going to be loosing some hair in an attempt of localizing them into finnish. Not because of the amount but because of their nature. The quantity of the finnish literature in modern computer programming is way below the levels when compared to other languages. Most books about programming are nowadays almost always in english which means that the finnish terminology is in danger of not being able to develop accordingly to the rules set by this industry. The impact is already very visible, and personally I’m going to be a little worried about the future-coming finnish localization… I think I will need to simply come up with some new terminology translations because there simply aren’t any established “official” words for their original english counterparts. This means that most translated terms sound silly because everyone is already used to their english meaning. There are some error messages that I already find difficult to properly translate into finnish. For now, I haven’t yet implemented the finnish translations although the support is there: someone would just have to fill in the missing strings. Well, maybe I’ll just delegate the job to someone in the devTeam later 🙂

I have finished the “maintenance round” now, and the compiler application is now operating well. I even tested running the compiler from within a .NET application, to receive its stream output. Please note that the compiler is still missing a lot of its functionality, and is not by any means, finished; It’s not yet able to produce the IL from the source code – which is pretty much the only thing remaining. Yes, the end of the tunnel is already visible, but there’s still a long walk before we reach it. I freshly added a new functionality to Pass#1 to gather detailed list of code lines and the exact method they belong to. Actually, I already had that data, but this modified collecting process gets the job done better. I now have a list of “executable code tokens” that need to be analysed for the final time, only this time we can be sure that their syntax is already OK. Practically, we’re cleaning it up, adding the needed metainfo, and dismantling it into fundamental theorems. Think about the For statement… it’s actually a While...EndWhile block.

So there’s only two major things remaining… the IL and the compiler Backend. Wow.

Pass#2 half finished

I’ve been working hard on CoolBasic V3 for the past few months. I have a regular job (as a programmer), so development time limits to weekends, evenings, and nights. Basically, I spend around 4-5 hours on average developing the CoolBasic Compiler every day, 8-12 hours aday during weekends and hallows. That being said, I thought that some vacation coudn’t possible be bad at all… So I kept 4 “days off” during the Easter, trying to clear my mind. But still I’m getting the feeling that I’m slacking. I finished the Expression Processor the day after I returned. As you may recall, I characterized this entity as for being the most complex procedure of the entire compiler so far. I’m very pleased that this enormous job is finally done. A few small things still remain to be taken care of, like handling access to Shared members and verifying legal type casting, but the outlines are now solid and the system seems to be working with the current (fairly sophisticated) algorithm.

There’s also something special worth mentioning: While I was creating the latest back-up copy, I noticed that there’s now exactly 1 Megabyte of pure source code. That means roughly 1 million key strokes! To visualize, count to 100 in your mind as if you were typing relatively fast. When you’re done, imagine to do it again for 10,000 more times! Yep, that’s a lot of code! Now, one could think that this can be explained with extensive copy-pasting, but that is not the case; in fact, the compiler architecture is designed to make extensive use of functional approach, meaning that practically speaking all of the procedures are so specialized that they don’t recycle code at all.

To illustrate, remember when I said that the type resolver and the identifier resolver would look alike and function similarly? That was the assumption I had before I actually got started with the Expression Processor. Well that assumption is no more, I turned things pretty much upside-down (there’s a lot more into it for the Expression Processor when compared to the Type Resolver where certain assumptions about the expression’s structure can be made). To sum up, Type Resolver is now fully operational, Name Resolution is now fully operational, Overload Resolution is now fully operational, Constant Value pre-calculator is now fully operational, and semantic checks are nearly finished. I’m also testing an idea of my own, to solve the grouping issue explained here.

EXE or DLL?
One day I was pondering which way would be better. The original approach was to make the compiler a Dynamic Linked Library, but bundle it with a Commandline Compiler (which would ultimately call the DLL anyways). If one wanted to create, say, an own IDE – or just access the compiler without CoolBasic being fully installed, there would be 2 choises depending on the situation and personal preference. However, a DLL cannot write to Standard Output Stream (well, this is actually not true, but there’s another reason why it’s a bad idea to try doing that), so compiling progress would have to either be completely silent, or the compiler would have to create a window equipped with a progression bar or at least a label with a description. Also, what *if* the compiler would crash (which of course doesn’t happen)… the entire IDE would crash, too. On the other hand, A DLL function could return a string with embedded metainfo about the end result whereas an executable can only return an error code back to the system. And then there’s the unpredictable memory and resource management of an attached DLL. Even though the CoolBasic Compiler is very well designed to free all resources after compilation, I’m still hesitating. A safe executable just ends, freeing all reserved memory 100% guaranteed. Going always through the Commandline Compiler would overcome that problem, but what’s the point of a DLL then in the first place?

As it stands now, I think I’m going to end up with Commandline Compiler only (which the IDE will silently call, meaning no visible console window will be created). This way the compiler can message back straight to the IDE, as well as provide with a log file for more details. It will also operate in an independent manner and free all its resources 100% sure. Possible OS porting will be easier in the future, too. There may be some priviledge issues involved, but let’s examine those more closely when the matter becomes more relevant.

Closing words
I’ll tweak the Expression Processor some more by fine-tuning its functionality and fixing any remaining bugs. While anything can always suddenly arise, at this point I’m confident that I can soon start thinking of the next steps. That would be something like processing all those code lines that execute program logic, and to generate appropriate IL for them. I’d say we’re half-a-way into Pass#2 now (and yet Pass#2 has more code than Pass#1 has).

Pass#2 Semantic checks

Data type resolution works now in all cases (I’ve extended its capabilities since I last mentioned about it). A quick wrap-up on Pass#2 would go as follows: Before anything can be calculated – including constant expressions – the compiler will first need to resolve the data types of all identifiers. In practice, this applies to variables, constants and procedures. In the other hand, in order to make type resolution fully operational, all alias names need to be resolved beforehand. A “type” here refers to a structure or a class. Also, “name resolution” in general refers to both type resolution and normal identifier (that generates a value) resolution. These two resolvers greatly differ from each other, thus I need to implement them both in such way that they operate independently from each other – excluding the fact that the identifier resolver may call data type resolver when needed. Anyways, the type resolver must be implemented at this point whereas adding value-generating identifier resolution just isn’t possible yet.

One could wonder why can’t I start working on normal identifier name resolution now that type references have been resolved. In OOP there’s a lot semantic checks that need to be done before resolving identifier references because otherwise you would get possibly misleading errors during compile time. Those semantic checks couldn’t be done in Pass#1 because data types weren’t resolved yet. The checks include base class checks, inheritance checks, override checks, and some special rules – only to name a few. Basically the compiler iterates the symbol table here, and performs a symbol-specific set of checks. All of these checks are not mandatory in order to produce a working executable, but they’ll ensure that the source code is overall valid and semantically correct. For example, a semantic error within a base class may not occur if name resolution returned a valid reference before that.

I’m going to remove part of the ambiguous name checks in Pass#1. Procedure signatures were formed in Pass#1, and they were used to check for identical overloads. Unfortunately, qualified names can bypass these error checks, so the signatures need to be checked again after type resolution. I could let those checks be as they are, but I’d like to shift these checks over to Pass#2 semantic validator so that they appear in one place only. With small adjustments to the type resolution, I think I can cover overloads much better overall. Overloads don’t apply to functions alone anyways.

I think I can finally get started with the most complicated sub-routine of this entire compiler (Expression Processor) once I finish with these semantic checks. As a side-note, I estimate CoolBasic V3 compiler is way past the halfway now. First things first, though – let’s get these semantics over with.

Compile time error messages

It’s been two and a half weeks since the last entry. It may seem a long time without any updates, but I’ve been working very hard on Pass#2 during this whole time. I finally managed to establish the architecture for data type resolving and name resolution. I just didn’t want to express any thoughts on this highly experimental phase until the solution that actually works on all test cases, has been found. That being said, I’ve spent the last few days on small improvements and bug fixing. Also, this is a good spot to clean up the code and fine-tune the work done so far until I get started with the next big thing (which I won’t elaborate at this point). One thing that needs improving and what has been haunting me for some time now, is the way the compiler expresses compile time errors to the user. The coding has been so hectic from the start that even though the compiler is able to express all the needed errors, some rephrasing is surely needed. Especially for Pass#1 which features a lot of syntax checks, and thus a lot of similarly formatted error messages.

Current V3 compiler error messages
For reference, compilation stops to the first error occurred – to prevent further (possibly misleading) errors from a chain-reaction the first error might have caused. This is identical to how current CoolBasic and many other Basic language compilers report errors. If you have programmed in C and the related languages, you might be familiar with the error flood you get when you try to build a project that had maybe just few small mistakes in the source code. In most cases, the tail of that list consisted of completely mysterious errors, and only because of the preceding errors caused flawed compiling process to proceed until the compiler got too confused to be able to continue. Luckily, compilers have become more intelligent regarding this matter, and for example, Visual Studio is able to construct a list with only the essential errors in it. CoolBasic V3, however, will continue to terminate the entire compilation on first error because of safety and stability. The sooner you get back to fixing it, the better.

Back to the current V3 implementation. To save time, errors are returned as strings when they occurr. I didn’t want to gather the absolute list of Pass#1 errors beforehand because it would have changed anyways. This was fully intended at that point in development. However, now that Pass#1 is essentially finished, I know all the errors that can occurr within it. And there’s now the need of rephrasing which means I would have to Find & Replace them all. Instead, it’s better to standardize the messages by binding them to constants; rephrasing them would require the change to be made to one place only. This need was also foreseen, but now it’s time to implement it. I would have changed the error message generation anyway, but this is also a great opportunity to think over better ways to express errors in terms of how they are phrased; Currently there exists a few vague messages that don’t provide enough information. For example, what does “Statement out of context” tell to the programmer?

The ideal error message
A good error message is uniformal, informative and precise. It should also provide a hint on how to correct the error. That way the programmer doesn’t have to visit the manual or spend time to understand what the error could mean in that particular case – which improves productivity and overall relish. However, being able to tell the error line, code module, the cause and the possible error code for further reference, is not enough. What if you had separarate statements written on the same line (separated with : of course), and you got an error saying “Error at line bla bla in code module bla bla: This statement cannot appear here.” More unnecessary time trying to figure out which statement caused the error. From my point of view it wouldn’t be too much of a work to bother adding this information to the error message, but omitting it would surely add up when hundreds of end users write programs in CoolBasic every day. Picture this: “Declaration context of ‘Const’ is invalid.” Sounds better? Sure, but something is still missing.

Normally I would add a hint on how to correct this error by specifying where constants are allowed or where they cannot appear. In this particular case it’s not that simple, because even though constants can appear within procedures and blocks, they can’t appear within Property or Select statement bodies. Rephrasing this error to cover exact definition where a constant can be declared would result in too complicated error message. Uniformality excludes me from listing the legal declaration contexts as well because their exact contents can vary depending on the statement. Compromises do happen.

Precise error messages also tell meta information about the error. For example, a vague error message like “Identifier is already defined.” doesn’t really tell which identifier is causing the conflict. A better version could be “‘myVariable’ is already declared in its context”, but it can be improved even further, providing more narrowing information: “Variable ‘myVariable’ is already declared in this Class.” – or even by providing the full declaration information, including the data type definition.

Whereas “precise” refers to “contextual detail”, informative errors specify moreover what’s syntactically or semantically wrong. For example, instead of telling “Syntax error. Invalid statement.” you could specify “End of statement expected.”

Detailed error messages tend to increase their amount. That generates variance which easily gets out of hand. Thus, even though you have practically different error messages for telling how different kinds of identifiers conflict in context, or what token statements expect next, they should follow the same basic formatting. This will not only result in impression of quality to the end user, but also makes the developer’s life easier in the future. Uniformality also emerges within the error codes. Even though variables and constants generate slightly different error messages when, say, their data type could not be resolved, they still share the same error code because the error occurred from the same cause. In other words: Error codes are not tied to the actual (unique) description. Something I need to keep in mind…

Showing the error to the user
The current CoolBasic version shows errors inside of a modal window with the code and description in it. Because of its modal nature, you must close the window in order to edit your code. And not only that, you can’t do anything while the windows is there so you *need* to close it. Does anyone else sometimes forget what the actual error message was, while you’re thinking and fixing? Well, I do, and it’s irritating when it happens because I will then probably compile it again to see it. Also, I don’t want to hear the Windows’ exclamation sound every time the compilation fails. In addition, there’s a known issue of Scintilla sometimes invalidating its rendering area when a modal window pops up basically making the source code invisible for quick inspection while the error window is present. So I’m experimenting with an idea to leave the latest error message visible to the bottom section of the editor, in a similar way than Visual Studio does. Also, when you click on that section, you could open the manual page for more details. You could even filter that section to show full error history or the latest error only. The error message should also be easy-to-read meaning that the error module and line number should appear in, say, inside their own respective columns.

The principle is that errors should be silent, but noticeable. And they should not create any unnecessary stir or require actions from the user.

Localize the compiler too?
I have mentioned several times that CoolBasic V3 will be fully localized into both english and finnish. This includes manual, website, forums, editor, tools, examples, tutorials and all of the content in general. There has been one hesitation to the rule, however. Normally compilers don’t get localized. It’s just the way it is, and nobody really questions it. Due to absence of real life examples and “significant” products pioneering this concept (To be honest, I don’t know about Visual Studio. I’m too lazy to check it out, but I haven’t heard about such thing), and combined with the amount of the required maintenance or the infrastructural problems within the compiler’s architecture, the idea hasn’t really catched on. Parts of the error messages would be in english anyway because common base class libraries are mostly coded in english, thus their identifier names mix to the localized language, making it look funny.

With all this rephrasing going on, it’s possible for me to build support for both english and finnish error messages. And I have decided to give it a try. You can always disable it, right? However, I will probably use english as the default compiler language even for localized CoolBasic unless separately changed from the editor’s settings.

The changes brought up in this blog entry will keep me busy for a few days…

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress