[Lexer] Add Unicode identifier and whitespace recognition #23

kuilpd · 2025-02-25T15:21:52Z

Added skipping both Unicode and ASCII whitespaces in the beginning.
Replaced IsWord code with Unicode identifier recognition, the output then gets checked for being a keyword like before. Codepoints for Unicode whitespaces and identifiers are taken from Swift lexer.
The length of an identifier gets counted in Unicode characters and added to the position tracker.

…rs to it

kuilpd · 2025-03-03T20:36:09Z

@labath @cmtice
Removed all Unicode character checks for identifiers and added counting the token length using llvm::sys::unicode::columnWidthUTF8.
I don't like though that this function does UTF conversion and character length counting again, since it's already done during identifier checking. I could've used the function charWidth there, but unfortunately it's static inline.

kuilpd · 2025-03-07T12:58:02Z

Did some benchmarking. Using llvm::sys::unicode::columnWidthUTF8 slows down the entire lexer by ~35%. After moving llvm::sys::unicode::charWidth to the header and reusing it, the slow down is about ~25%, a bit better. Sticking with this solution for now, but maybe we should reconsider whether we need this width counting at all.

labath · 2025-03-10T14:23:52Z

I'm going to be very unsympathetic to any arguments about performance until I see some data that shows that the lexer takes up an appreciable portion of the time it takes to evaluate a DIL expression. Since most of the DIL expressions are going to be less than ~20 characters long, I find it very hard to imagine an implementation that would be too slow.

That said, the reason I suggested this function is because I thought you wouldn't be doing any unicode conversions in the lexer (When I said I wanted to treat all unicode chars as identifiers, I really meant all of them, Ogham Space Marks (U+1680) included). If you're already doing unicode conversions (*), then counting those is good enough for me, as the main thing I'm optimising for here is the complexity of the implementation

(*) There's a lot less space characters than there are potential identifier chars, and I think they're a lot less ambiguous, so if you really think they are needed (I don't), I think I'd be fine with that. That said, given that there's so few of those, and in the aforementioned interest of reducing the amount of code written. I think it would be easier to skip those via something like:

StringRef SkipSpaces(StringRef text) {
  while (true) {
    StringRef cur = text.ltrim();
    cur.consume_front("\u0085"); // Next Line (Nel)
    cur.consume_front("\u00a0"); // No-Break Space
    cur.consume_front("\u1680"); // Ogham Space Mark
    // ...
    if (text.data() == cur.data()) return text; // no spaces consumed
    text = cur;
  }
}

(i.e., let the compiler convert these into byte sequences and then use StringRef operations for the rest).

kuilpd · 2025-03-11T16:09:43Z

Removing the non-standard whitespaces makes sense, I guess even the Unicode identifiers are usually separated by a regular whitespace anyway.
And since Unicode codepoints are not checked anymore, I replaced the conversion with isLegalUTF8Sequence.

kuilpd added 3 commits February 24, 2025 21:51

Add basic support for UTF-8 identifiers

56aa0de

Replace IsWord with identifier UTF recognition and add ASCII characte…

97fa509

…rs to it

Refactor code & add unicode whitespace skipping

526df72

kuilpd requested review from asl and cmtice February 25, 2025 15:21

kuilpd added 3 commits February 27, 2025 19:24

Add test for an identifier with a right-to-left script

ccab3d1

Count UTF-8 width of skipped whitespaces

4e79115

Allow identifier to be contain anything except operators and whitespaces

6520218

Manually count identifier and whitespace width using charWidth

20ee555

kuilpd added 2 commits March 11, 2025 00:00

Remove Unicode whitespaces recognition

ce05c75

Replace unicode codepoint conversion with UTF8 validity check

daf1735

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Lexer] Add Unicode identifier and whitespace recognition #23

[Lexer] Add Unicode identifier and whitespace recognition #23

Uh oh!

kuilpd commented Feb 25, 2025

Uh oh!

kuilpd commented Mar 3, 2025

Uh oh!

kuilpd commented Mar 7, 2025

Uh oh!

labath commented Mar 10, 2025

Uh oh!

kuilpd commented Mar 11, 2025

Uh oh!

Uh oh!

[Lexer] Add Unicode identifier and whitespace recognition #23

Are you sure you want to change the base?

[Lexer] Add Unicode identifier and whitespace recognition #23

Uh oh!

Conversation

kuilpd commented Feb 25, 2025

Uh oh!

kuilpd commented Mar 3, 2025

Uh oh!

kuilpd commented Mar 7, 2025

Uh oh!

labath commented Mar 10, 2025

Uh oh!

kuilpd commented Mar 11, 2025

Uh oh!

Uh oh!