Hello again! After 3 weeks of hard writing, my thesis is nearly finished. It only lacks a few edit rounds, so I can now eventually return back to work with CoolBasic V3 compiler. I’m sorry to say this, but I still have lots of work to do until I can start thinking of taking any actions to make the compiler public to anyone – including the DevTeam. As always in software engineering, surprising things tend to pop up, and almost every time they result in extended development time and higher cost. Fortunately I’m not under the pressure of any deadlines nor do I have to pay anything, but this is going to take considerably more time as originally intended. Let me explain why.
Before my break, I conducted some pressure tests on the compiler. I artificially generated a huge source code consisting of nearly 10,000 symbols and close to 2 million tokens. The average line length was around 1,000 characters. Those lines contained overly complicated expressions no-one would ever write anyway, but this kind of a code should be a good benchmark to test the true performance of the compiler. Now, even though CoolBasic V3 compiler is quite fast, this particular structure of the test program caused some of the compiler’s characteristics to exaggerate, causing cumulative increase in compile time.
The huge test program compiled in 12 seconds. Needless to say I was quite surprised. I knew this code could take a few moments to compile, but to be honest, I didn’t expect to see it taking that long. I found this intriguing, and I quickly ported the exact same code over to VB.NET, and ran the VB.NET compiler manually from command line. The process completed in 2 seconds, meaning CoolBasic was 6 times slower than VB.NET compiler. I’m too much of a perfectionist to let this be, because I know I can do better than that.
At this point I’d like to emphasize that “normal code” i.e. natural mix of “normal length” code lines, empty lines and indentations as well as wider diversity of different kinds of statements would probably bring CoolBasic compiler closer to VB.NET compiler. However, if I come to know that there’s a potential problem, I will fix it – even though nobody would ever write this kind of an absurd code. That’s indication of quality. The fact alone that something awful is possible, troubles me. Now, someone could think that this is all big waste of time, but the truth is that this is one of those things you’d better fix as early as possible because *if* somebody someday founds it, then you’d be screwed.
So I started to map which parts of the compilation process took longest to complete. The results amazed me: Those 12 seconds broke down to three sections: lexing (3.0 seconds), parsing (0.9 seconds), and code analyzing & transformation (8.5 seconds). I was mostly surprised by the lexer that munched one quarter of the entire process. After further investigation it occurred to me that the root of all evil was the file reading process (now what’d ya know!) I never saw that one coming. The compiler reads one line at a time from the source file, and then passes it to the lexer. OK, I was aware that this might not be the fastest way possible, but I didn’t think it’d make much difference.
The lexer needs to scan every character individually, utilizing the Mid-function of PureBasic. Now it appears to be that in case of exceptionally long strings (like 1000+ characters we have here) the Mid-function is underperforming. I wrote my own in assembly, and managed to squeeze some extra speed, but the lexing process still took, in my opinion, too long. Eventually I found out that the other string functions of PureBasic, were too slow for the job as well. So I started brainstorming alternative ways of implementing the file reader and lexer.
Another bitter surprise was when I found out that PureBasic’s in-built linked list support (something that CoolBasic V3 compiler makes great use of) is not fully optimized for high performance – at least not for my needs. During those 2 million tokens, there will be hundreds of millions of requests, iteration, deletions, and additions to those lists which ultimately add up resulting in slowish compilation for very lengthy expressions (such as in the test program).
Sure thing there are some things I could have done a bit differently that would’ve helped to some extent, but in the end not much. I’m a little upset of all this because for the most part it’s not my fault. I learned an important lesson though: you want it right, you do it yourself! While I was writing my thesis, I hatched a few great ideas I believe will make a great difference. Perhaps those ideas will bring CoolBasic V3 compiler very close to VB.NET compiler (that has now become the goal). Below I have listed some of the actions I’m taking in order to substantially improve the performance of the CoolBasic V3 compiler (and I think they’re well worth the extra time it takes to implement them):
Full x64 support
To be honest, better to implement this now. Or otherwise I would have to do it later anyways, which would probably be too difficult a task and most likely cause the entire compiler to be re-written from the scratch. This in turn would probably yield a good amount of new bugs rendering end-users not so happy. 64-bit architecture is quite exciting a feature although the runtime engine won’t probably support it immediately. As opposite from what I started the CoolBasic project with, I now have a 64-bit Windows Vista at my disposal, making it possible to develop and run 64-bit code.
Full Unicode Support
This is another thing that needs to be taken into account at a very early stage in development. Not only Unicode support enables the compiler to read ANSI, UTF8, and UTF16 (both Little Endian and Big Endian) encoded files, this will also ensure that all pre-compiled modules are compatible with each other. This also means that strings in CoolBasic V3 take up 2 bytes for each character. Also, it’s now finally possible to use virtually any localized character in source file identifiers, comments and so on.
Hi-performance FileReader
The huge test file was 4.5 megabytes big. On my laptop, reading every line, and then individually scanning each character, took 3.4 seconds in total. There was no further processing to read characters, which makes the elapsed time so alarming. However, the new file reader processed the exact same task in 0.04 seconds, resulting in 85x better performance – and that’s with additional widechar conversion. Also, this time there will be no string operations during the entire V3 compilation process. Even string comparison during name resolution is done via hi-performance smart selection algorithm and memory API.
Hi-performance Lexer
With the change to FileReader, also the lexer needs to be re-written. It will now be based on state information rather than an algorithm similar to regular expressions that was used before. I’m injecting manual assembly in order to jump directly to different token handlers. Obviously, this is breaking a number of “best practices”, but the speed gain is notable enough to justify. The lexer also features a dynamically reallocatable text buffer which makes identifier, keyword, and string building very fast. Some additional testing still needs to be done, but overall I expect a prominent speed gain over the previous lexer.
Hi-performance TokenList
While I was brainstorming, I suddenly got this awesome idea of implementing a linked list of my own as a replacement over PureBasic’s slowish implementation. Token list is one of the most used entities during the compilation, and its speed is vital in order to perform well with large projects (such as complex games). I was eager to test my theory, so I wrote a bunch of linked list related functions, basically forming entire base of a linked list library. Its features are much more diverse and precise when compared to linked lists in PureBasic, offering functions such as AddAfter, AddBefore, AddFirst, AddLast, and RemoveRange.
The test program had a linked list of 2,000,000 token elements, so the estimated memory usage was 46 megabytes for 32-bit version. During the test, 700,000,000 element additions and deletions were made in total, targeting various locations within the list – some at the beginning, some in the middle, and some at the end. Coolbasic V3 list implementation completed the task in 17.0 seconds on average, whereas PureBasic’s equivalent test program completed in 34.3 seconds on average. This means that CoolBasic’s version is more than 2X faster. In addition, memory usage at the peak for CoolBasic V3 compiler was 52 megabytes and 72 megabytes for PureBasic. I was happy.
If we test simply how fast the system walks through the list, and not making any allocations or deallocations there, PureBasic catches up: CoolBasic compiler is now “only” 1.6X faster: Fully iterating 2,000,000 tokens long linked list 1,000 times took 17.1 seconds on average for CoolBasic, and 27.2 seconds on average for PureBasic. The algorithm behind this hi-performance linked list will remain my little secret.
Hi-performance SymbolTable
The second most used entity during the compilation process is with no doubt the symbol table; there can be millions of queries to it. I admit I could’ve designed its iteration more powerful in the first place, but the symbol table is also linked list based, and thus doomed to underperform a task that I know can be made a lot faster. Therefore, using the new linked list technology described earlier, I can now not only gain speed for regular manipulation of the list, but also do some things that were simply not possible before. For example, I no longer have to change the list pointer every time I wish to add an element at the end of the list.
I’m also going to re-design the structure of the symbol table. Although consisting of a single linked list, each node can now store detailed information about its children as well as pre-determined pointers to next and previous node of the same parent node. This technique ensures that only those elements that need to be examined, will walk through iteration, which will obviously result in swift table lookups – combined to new hi-performance resolvers. The new structure isn’t fully taken its final shape yet, but eliminating big part of search time every time an identifier is confronted ought to have a great impact on performance in general.
Hi-performance GlobalMemoryCache
Manual memory allocation is a crucial part of the V3 compiler. This process can be improved by packing the data tight in larger memory “pages”. Actual allocations occur more seldom, and the stored information is easier to transfer and cross-link between modules. Also, when a compile job finishes, it’s easier for me to deallocate everything with a single call rather than iterating something through, and then deallocating pieces here and there when necessary.
Hi-performance Resolvers
Currently, CoolBasic V3 name resolution differs from VB.NET/C# name resolution a little bit. It was a conscious decision to allow the resolving process to continue even though a non-accessible branch was found because there could be more accessible option presented within the upper levels. Due to this design, the resolvers were implemented as recursive functions. They needed to pass information between each calls in a linked list, which has proved to be inefficient in PureBasic. However, for the sake of performance, this no longer seems like a good solution, given that VB.NET and C# operate differently in this case. So, I’m probably going to change this technique, which should relieve some burden from this core compiler functionality. While doing so, I hope I can combine the type resolver and general identifier resolver into one, more manageable entity. By this change I will also gain a number of other straits as well, but I don’t wish to turn this blog post into any more technical babble than it already is.
And in the end…
Needless to say that these are all pretty big changes, and yes, the compiler needs a major re-write to implement all those things. The current V3 compiler is by no means bad, and I’m very proud of it. It’s just that if things could be done better, it’s too tempting to let them be as they are – especially since CoolBasic V3 isn’t out yet. Utilizing the knowledge I’ve gained during this journey, the new core shouldn’t take anywhere as long to complete as the original V3 compiler did. I, too, wish to get V3 out some day, you know 🙂 I think these changes are necessary, and are well worth the extra time.
Moral of the story? Low level programming for performance is good, high level programming for performance is bad. I’m really sorry about the delay since I couldn’t have known about these weaknesses of PureBasic until it was too late. You want it right, you do it yourself!