{"id":218,"date":"2009-08-14T14:12:11","date_gmt":"2009-08-14T12:12:11","guid":{"rendered":"http:\/\/www.coolbasic.com\/blog\/?p=218"},"modified":"2009-08-14T14:12:11","modified_gmt":"2009-08-14T12:12:11","slug":"introducing-the-new-lexer","status":"publish","type":"post","link":"https:\/\/www.coolbasic.com\/blog\/2009\/08\/14\/introducing-the-new-lexer\/","title":{"rendered":"Introducing the new Lexer"},"content":{"rendered":"<p>Last time I showed you some test results of the performance of CoolBasic V3 compiler. The numbers weren\u2019t too appealing when the compiler was fed a huge special kind of a test file. The structure of that certain file caused some compiler bottlenecks to amplify, which resulted in slower compile time than expected. All this got me into thinking of more optimized ways to apply lexical analysis, parsing, analyzing and code transformation. In my previous blogging I introduced some visions of mine in how to fix those issues, and the modifications to the lexer are now in place.<\/p>\n<p>The lexer is executed in the beginning of the compilation process. It\u2019s responsible for identifying tokens from the file stream and then saving the tokens to a linked list for later processing. This new implementation also uses the new fast FileReader, and it features full widechar conversion and Unicode support.  The lexer also now utilizes the new GlobalMemoryCache system and super-fast linked list system. In addition, the compiler now runs in both 32-bit and 64-bit mode. So, lots of new technologies have already been implemented, and the base to build the parser and analyzer upon, is eventually materializing. It wasn\u2019t easy to come through, though.<\/p>\n<p>My first idea was to use hand-written Assembly code in order to jump between different lexer states without actually needing to test any conditions for it. Basically the FileReader just called asm <code>jmp<\/code> to certain label address that was dynamically stored in a variable. It was faster than a procedure call, and certainly faster than a procedure call + set of conditions to test the lexer state against, but I found out the hard way that this wasn\u2019t the most secure way of controlling the program flow: I experienced some odd side-effects I couldn\u2019t explain what caused them, so I eventually threw the Assembly code out of the window, and did it in the \u201cright\u201d \u2122 way. In addition, in 64-bit mode you\u2019d have to use the <code>qword<\/code> modifier in order to perform the asm <code>jmp<\/code> anyway, which for some reason, is slower than in 32-bit mode \u2013 in fact it would\u2019ve been a bit slower than a standard procedure call.<\/p>\n<p>During my \u201chappy\u201d Assembly experimentation I experienced very slow compilation (probably lost some hair and\/or beard because of it). First I thought it was because of my bad asm-cookings, but the problem persisted even after I removed all hand-written Assembly from the source. After lots of digging and peeking, I finally found the cause: It was just one little procedure that gets called only once per line, but which took 0.8 milliseconds to complete every time. For big test files (11,000 lines+) this adds up. It took me countless of hours and lots of cursing, but when this little issue had been identified and properly handled, the lexer had finally reborn.<\/p>\n<p>The lexer is now indeed very quick \u2013 I\u2019d say I achieved my goal. The difference compared to the previous V3 lexer (which also was quite fast for \u201cnormal\u201d code i.e. a code somebody would actually write and not artificially created and absurdly large test files) is notable: The new lexer outperforms the previous V3 lexer by up to 6X! The performance was measured with 5 artificially generated, and relatively large, test files as follows:<\/p>\n<blockquote><p>File size: 4.5Mb<br \/>\nPhysical code lines: 11,000<br \/>\nAverage line length: 1,100 characters<br \/>\nConsists of keywords, identfiers (6-10 charactes each), integral and decimal numbers (5-7 characters each), parentheses and wide range of different operators<br \/>\n1.6 million tokens<br \/>\nFile read + full lexical analysis: <strong>0.36<\/strong> seconds on average<\/p>\n<p>File size: 12.2Mb<br \/>\nPhysical code lines: 11,000<br \/>\nLine length: 1,137 characters<br \/>\nConsists of identfiers (7 charactes each)<br \/>\n1.6 million tokens<br \/>\nFile read + full lexical analysis: <strong>1.54<\/strong> seconds on average<\/p>\n<p>File size: 12.5Mb<br \/>\nPhysical code lines: 11,000<br \/>\nLine length: 1,163 characters<br \/>\nConsists of integer numbers alone (6 digits each)<br \/>\n1.8 million tokens<br \/>\nFile read + full lexical analysis: <strong>0.31<\/strong> seconds on average<\/p>\n<p>File size: 14.3Mb<br \/>\nPhysical code lines: 11,000<br \/>\nLine length: 1,329 characters<br \/>\nConsists of double decimal numbers alone (6 digits + decimal separator each)<br \/>\n1.8 million tokens<br \/>\nFile read + full lexical analysis: <strong>0.39<\/strong> seconds on average<\/p>\n<p>File size: 12.1Mb<br \/>\nPhysical code lines: 11,000<br \/>\nAverage line length: 1,150 characters<br \/>\nConsists of random language keywords (83 different kinds of)<br \/>\n1.4 million tokens<br \/>\nFile read + full lexical analysis: <strong>0.87<\/strong> seconds on average<\/p><\/blockquote>\n<p>Even though most processing is done in memory, the lexer doesn\u2019t read several megabytes long files completely into memory. This keeps memory pressure in manageable boundaries.<\/p>\n<p><strong>And in the end\u2026<\/strong><br \/>\nIt\u2019s now possible to use the underscore as a line continuation character, thus the following code is considered to be one logical line:<\/p>\n<pre name=\"code\" class=\"cbv3\">Function MyLongFunc( _\n    parameter1 As Integer, _\n    parameter2 As Integer, _\n    parameter3 As Integer, _\n    parameter4 As Integer, _\n    parameter5 As Integer _\n)<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Last time I showed you some test results of the performance of CoolBasic V3 compiler. The numbers weren\u2019t too appealing when the compiler was fed a huge special kind of a test file. The structure of that certain file caused some compiler bottlenecks to amplify, which resulted in slower compile time than expected. All this [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[5,8],"_links":{"self":[{"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/posts\/218"}],"collection":[{"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/comments?post=218"}],"version-history":[{"count":0,"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/posts\/218\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/media?parent=218"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/categories?post=218"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.coolbasic.com\/blog\/wp-json\/wp\/v2\/tags?post=218"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}