Introducing the new Lexer

Last time I showed you some test results of the performance of CoolBasic V3 compiler. The numbers weren’t too appealing when the compiler was fed a huge special kind of a test file. The structure of that certain file caused some compiler bottlenecks to amplify, which resulted in slower compile time than expected. All this got me into thinking of more optimized ways to apply lexical analysis, parsing, analyzing and code transformation. In my previous blogging I introduced some visions of mine in how to fix those issues, and the modifications to the lexer are now in place.

The lexer is executed in the beginning of the compilation process. It’s responsible for identifying tokens from the file stream and then saving the tokens to a linked list for later processing. This new implementation also uses the new fast FileReader, and it features full widechar conversion and Unicode support. The lexer also now utilizes the new GlobalMemoryCache system and super-fast linked list system. In addition, the compiler now runs in both 32-bit and 64-bit mode. So, lots of new technologies have already been implemented, and the base to build the parser and analyzer upon, is eventually materializing. It wasn’t easy to come through, though.

My first idea was to use hand-written Assembly code in order to jump between different lexer states without actually needing to test any conditions for it. Basically the FileReader just called asm jmp to certain label address that was dynamically stored in a variable. It was faster than a procedure call, and certainly faster than a procedure call + set of conditions to test the lexer state against, but I found out the hard way that this wasn’t the most secure way of controlling the program flow: I experienced some odd side-effects I couldn’t explain what caused them, so I eventually threw the Assembly code out of the window, and did it in the “right” ™ way. In addition, in 64-bit mode you’d have to use the qword modifier in order to perform the asm jmp anyway, which for some reason, is slower than in 32-bit mode – in fact it would’ve been a bit slower than a standard procedure call.

During my “happy” Assembly experimentation I experienced very slow compilation (probably lost some hair and/or beard because of it). First I thought it was because of my bad asm-cookings, but the problem persisted even after I removed all hand-written Assembly from the source. After lots of digging and peeking, I finally found the cause: It was just one little procedure that gets called only once per line, but which took 0.8 milliseconds to complete every time. For big test files (11,000 lines+) this adds up. It took me countless of hours and lots of cursing, but when this little issue had been identified and properly handled, the lexer had finally reborn.

The lexer is now indeed very quick – I’d say I achieved my goal. The difference compared to the previous V3 lexer (which also was quite fast for “normal” code i.e. a code somebody would actually write and not artificially created and absurdly large test files) is notable: The new lexer outperforms the previous V3 lexer by up to 6X! The performance was measured with 5 artificially generated, and relatively large, test files as follows:

File size: 4.5Mb
Physical code lines: 11,000
Average line length: 1,100 characters
Consists of keywords, identfiers (6-10 charactes each), integral and decimal numbers (5-7 characters each), parentheses and wide range of different operators
1.6 million tokens
File read + full lexical analysis: 0.36 seconds on average

File size: 12.2Mb
Physical code lines: 11,000
Line length: 1,137 characters
Consists of identfiers (7 charactes each)
1.6 million tokens
File read + full lexical analysis: 1.54 seconds on average

File size: 12.5Mb
Physical code lines: 11,000
Line length: 1,163 characters
Consists of integer numbers alone (6 digits each)
1.8 million tokens
File read + full lexical analysis: 0.31 seconds on average

File size: 14.3Mb
Physical code lines: 11,000
Line length: 1,329 characters
Consists of double decimal numbers alone (6 digits + decimal separator each)
1.8 million tokens
File read + full lexical analysis: 0.39 seconds on average

File size: 12.1Mb
Physical code lines: 11,000
Average line length: 1,150 characters
Consists of random language keywords (83 different kinds of)
1.4 million tokens
File read + full lexical analysis: 0.87 seconds on average

Even though most processing is done in memory, the lexer doesn’t read several megabytes long files completely into memory. This keeps memory pressure in manageable boundaries.

And in the end…
It’s now possible to use the underscore as a line continuation character, thus the following code is considered to be one logical line:

Function MyLongFunc( _
    parameter1 As Integer, _
    parameter2 As Integer, _
    parameter3 As Integer, _
    parameter4 As Integer, _
    parameter5 As Integer _

Comments are closed.

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress