Skip to content

[Lexer] Add Unicode identifier and whitespace recognition #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: dil-main
Choose a base branch
from

Conversation

kuilpd
Copy link
Collaborator

@kuilpd kuilpd commented Feb 25, 2025

Added skipping both Unicode and ASCII whitespaces in the beginning.
Replaced IsWord code with Unicode identifier recognition, the output then gets checked for being a keyword like before. Codepoints for Unicode whitespaces and identifiers are taken from Swift lexer.
The length of an identifier gets counted in Unicode characters and added to the position tracker.

@kuilpd kuilpd requested review from asl and cmtice February 25, 2025 15:21
@kuilpd
Copy link
Collaborator Author

kuilpd commented Mar 3, 2025

@labath @cmtice
Removed all Unicode character checks for identifiers and added counting the token length using llvm::sys::unicode::columnWidthUTF8.
I don't like though that this function does UTF conversion and character length counting again, since it's already done during identifier checking. I could've used the function charWidth there, but unfortunately it's static inline.

@kuilpd
Copy link
Collaborator Author

kuilpd commented Mar 7, 2025

Did some benchmarking. Using llvm::sys::unicode::columnWidthUTF8 slows down the entire lexer by ~35%. After moving llvm::sys::unicode::charWidth to the header and reusing it, the slow down is about ~25%, a bit better. Sticking with this solution for now, but maybe we should reconsider whether we need this width counting at all.

@labath
Copy link

labath commented Mar 10, 2025

I'm going to be very unsympathetic to any arguments about performance until I see some data that shows that the lexer takes up an appreciable portion of the time it takes to evaluate a DIL expression. Since most of the DIL expressions are going to be less than ~20 characters long, I find it very hard to imagine an implementation that would be too slow.

That said, the reason I suggested this function is because I thought you wouldn't be doing any unicode conversions in the lexer (When I said I wanted to treat all unicode chars as identifiers, I really meant all of them, Ogham Space Marks (U+1680) included). If you're already doing unicode conversions (*), then counting those is good enough for me, as the main thing I'm optimising for here is the complexity of the implementation

(*) There's a lot less space characters than there are potential identifier chars, and I think they're a lot less ambiguous, so if you really think they are needed (I don't), I think I'd be fine with that. That said, given that there's so few of those, and in the aforementioned interest of reducing the amount of code written. I think it would be easier to skip those via something like:

StringRef SkipSpaces(StringRef text) {
  while (true) {
    StringRef cur = text.ltrim();
    cur.consume_front("\u0085"); // Next Line (Nel)
    cur.consume_front("\u00a0"); // No-Break Space
    cur.consume_front("\u1680"); // Ogham Space Mark
    // ...
    if (text.data() == cur.data()) return text; // no spaces consumed
    text = cur;
  }
}

(i.e., let the compiler convert these into byte sequences and then use StringRef operations for the rest).

@kuilpd
Copy link
Collaborator Author

kuilpd commented Mar 11, 2025

Removing the non-standard whitespaces makes sense, I guess even the Unicode identifiers are usually separated by a regular whitespace anyway.
And since Unicode codepoints are not checked anymore, I replaced the conversion with isLegalUTF8Sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants