-
Notifications
You must be signed in to change notification settings - Fork 0
[Lexer] Add Unicode identifier and whitespace recognition #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dil-main
Are you sure you want to change the base?
Conversation
@labath @cmtice |
Did some benchmarking. Using |
I'm going to be very unsympathetic to any arguments about performance until I see some data that shows that the lexer takes up an appreciable portion of the time it takes to evaluate a DIL expression. Since most of the DIL expressions are going to be less than ~20 characters long, I find it very hard to imagine an implementation that would be too slow. That said, the reason I suggested this function is because I thought you wouldn't be doing any unicode conversions in the lexer (When I said I wanted to treat all unicode chars as identifiers, I really meant all of them, Ogham Space Marks (U+1680) included). If you're already doing unicode conversions (*), then counting those is good enough for me, as the main thing I'm optimising for here is the complexity of the implementation (*) There's a lot less space characters than there are potential identifier chars, and I think they're a lot less ambiguous, so if you really think they are needed (I don't), I think I'd be fine with that. That said, given that there's so few of those, and in the aforementioned interest of reducing the amount of code written. I think it would be easier to skip those via something like:
(i.e., let the compiler convert these into byte sequences and then use StringRef operations for the rest). |
Removing the non-standard whitespaces makes sense, I guess even the Unicode identifiers are usually separated by a regular whitespace anyway. |
Added skipping both Unicode and ASCII whitespaces in the beginning.
Replaced
IsWord
code with Unicode identifier recognition, the output then gets checked for being a keyword like before. Codepoints for Unicode whitespaces and identifiers are taken from Swift lexer.The length of an identifier gets counted in Unicode characters and added to the position tracker.