internal: improve `TokenSet` implementation and add reserved keywords #17037

davidsemakula · 2024-04-09T11:58:07Z

The current TokenSet type represents "A bit-set of SyntaxKinds" as a newtype u128.
Internally, the flag for each SyntaxKind variant in the bit-set is set as the n-th LSB (least significant bit) via a bit-wise left shift operation, where n is the discriminant.

Edit: This is problematic because there's currently ~121 token SyntaxKinds, so adding new token kinds for missing reserved keywords increases the number of token SyntaxKinds above 128, thus making this "mask" operation overflow.
~~This is problematic because there's currently 266 SyntaxKinds, so this "mask" operation silently overflows in release mode.~~
~~This leads to a single flag/bit in the bit-set being shared by multiple SyntaxKinds~~.

This PR:

Changes the wrapped type for TokenSet from u128 to [u64; 3] ~~[u*; N] (currently [u16; 17]) where u* can be any desirable unsigned integer type and N is the minimum array length needed to represent all token SyntaxKinds without any collisions~~.
Edit: Add assertion that TokenSets only include token SyntaxKinds
Edit: Add ~7 missing reserved keywords
~~Moves the definition of the TokenSet type to grammar codegen in xtask, so that N is adjusted automatically (depending on the chosen u* "base" type) when new SyntaxKinds are added~~.
~~Updates the token_set_works_for_tokens unit test to include the __LAST SyntaxKind as a way of catching overflows in tests.~~

Currently u16 is arbitrarily chosen as the u* "base" type mostly because it strikes a good balance (IMO) between unused bits and readability of the generated TokenSet code (especially the union method), but I'm open to other suggestions or a better methodology for choosing u* type.

I considered using a third-party crate for the bit-set, but a direct implementation seems simple enough without adding any new dependencies. I'm not strongly opposed to using a third-party crate though, if that's preferred.

Finally, I haven't had the chance to review issues, to figure out if there are any parser issues caused by collisions due the current implementation that may be fixed by this PR - I just stumbled upon the issue while adding "new" keywords to solve #16858

Edit: fixes #16858

lnicola · 2024-04-10T10:33:55Z

How come this never fails in practice? Do we never construct a TokenSet with something higher than RET_TYPE?

Veykril · 2024-04-10T10:36:24Z

TokenSets only ever contain token syntax kinds which populate the low integer variants only, so in practice things never collide here. So while this is a nice idea here, we don't really need to do this (should be fine to just debug assert that we only ever pass a token syntax kind to these).

I have a private language project (based on rust-lang/rust-analyzer and rome/tools infra) where I split the syntax kinds into token and node ones actually (and a bunch of other invariant stuff, including generating the first and follow sets) that I might upstream at some point if I find the time for that.

lnicola · 2024-04-10T10:40:09Z

I mean, we're cutting it pretty close with LIFETIME_IDENT (119), but I can't find any higher than that. Of course, T![] doesn't make it any easier to check. But we could at least add an assertion for it.

Veykril · 2024-04-10T10:41:40Z

We should certainly have a (debug_)assert here

lnicola · 2024-04-10T10:42:44Z

But can we fit them all in 0 to 63 though? 😅.

(No, I don't expect it to make any difference)

Veykril · 2024-04-10T10:49:29Z

I mean we are using a u128 rigt now, that suffices no?

lnicola · 2024-04-10T10:54:08Z

Yeah, but if we could fit them in 64 bits, we'd save 8 full bytes of memory for a token set!

davidsemakula · 2024-04-10T11:01:04Z

TokenSets only ever contain token syntax kinds which populate the low integer variants only, so in practice things never collide here.

Gotcha, that explains why it hasn't been problematic before 🙂

So while this is a nice idea here, we don't really need to do this (should be fine to just debug assert that we only ever pass a token syntax kind to these)

As mentioned earlier, I ran into the issue while solving #16858

I attempted to add about ~7 reserved keywords (e.g. abstract, final e.t.c) which triggered the overflow(s).

I could solve the issue another way, but I thought adding the reserved keywords here was the best solution.

Veykril · 2024-04-10T11:24:09Z

I see, adding the keywords as kinds seems like the proper approach here. Can we change the token set to just contain a [u64; 3] then? I don't think splitting it into smaller primitives makes much sense and it's fine to "waste" some bytes here, we don't store these anywhere really, and if they become too big the compiler will probably pass them by ref anyways

davidsemakula · 2024-04-10T13:38:15Z

I've updated the implementation to use [u64; N] (currently [u64; 2], will grow to [u64; 3] when more tokens are added) and added assertion(s) that kind is not higher than the discriminant for the "last" token SyntaxKind.

I'll be pushing the fix for #16858 (i.e. adding missing reserved keywords and related auto-import tests) in a separate PR after this one is merged.

cc @Veykril @lnicola

Veykril · 2024-04-14T13:05:15Z

Codegenerating this seems a bit overkill, let's just keep the manual impl, I expect us to only ever need to bump the backing limit once (when you add the new contextual keywords).

davidsemakula · 2024-04-14T17:46:50Z

Done, also added fix for #16858

crates/parser/src/token_set.rs

Veykril · 2024-04-16T06:59:04Z

Thanks!
@bors r+

bors · 2024-04-16T06:59:07Z

📌 Commit 69fe457 has been approved by Veykril

It is now in the queue for this repository.

bors · 2024-04-16T07:00:14Z

⌛ Testing commit 69fe457 with merge e64610d...

bors · 2024-04-16T07:11:55Z

☀️ Test successful - checks-actions
Approved by: Veykril
Pushing e64610d to master...

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 9, 2024

davidsemakula changed the title ~~fix: TokenSet collisions~~ internal: improve TokenSet implementation Apr 10, 2024

davidsemakula force-pushed the token-set-collisions branch from 59b1a16 to ace4cf9 Compare April 14, 2024 17:29

davidsemakula changed the title ~~internal: improve TokenSet implementation~~ internal: improve TokenSet implementation and add reserved keywords Apr 14, 2024

davidsemakula added 3 commits April 15, 2024 15:06

internal: improve TokenSet implementation

89779ca

internal: add reserved keywords

e326b63

internal: add auto-import assist tests for raw identifiers

8e45912

davidsemakula force-pushed the token-set-collisions branch from ace4cf9 to 8e45912 Compare April 15, 2024 12:07