Skip to content

internal: improve TokenSet implementation and add reserved keywords #17037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 16, 2024

Conversation

davidsemakula
Copy link
Contributor

@davidsemakula davidsemakula commented Apr 9, 2024

The current TokenSet type represents "A bit-set of SyntaxKinds" as a newtype u128.
Internally, the flag for each SyntaxKind variant in the bit-set is set as the n-th LSB (least significant bit) via a bit-wise left shift operation, where n is the discriminant.

Edit: This is problematic because there's currently ~121 token SyntaxKinds, so adding new token kinds for missing reserved keywords increases the number of token SyntaxKinds above 128, thus making this "mask" operation overflow.
This is problematic because there's currently 266 SyntaxKinds, so this "mask" operation silently overflows in release mode.
This leads to a single flag/bit in the bit-set being shared by multiple SyntaxKinds.

This PR:

  • Changes the wrapped type for TokenSet from u128 to [u64; 3] [u*; N] (currently [u16; 17]) where u* can be any desirable unsigned integer type and N is the minimum array length needed to represent all token SyntaxKinds without any collisions.
  • Edit: Add assertion that TokenSets only include token SyntaxKinds
  • Edit: Add ~7 missing reserved keywords
  • Moves the definition of the TokenSet type to grammar codegen in xtask, so that N is adjusted automatically (depending on the chosen u* "base" type) when new SyntaxKinds are added.
  • Updates the token_set_works_for_tokens unit test to include the __LAST SyntaxKind as a way of catching overflows in tests.

Currently u16 is arbitrarily chosen as the u* "base" type mostly because it strikes a good balance (IMO) between unused bits and readability of the generated TokenSet code (especially the union method), but I'm open to other suggestions or a better methodology for choosing u* type.

I considered using a third-party crate for the bit-set, but a direct implementation seems simple enough without adding any new dependencies. I'm not strongly opposed to using a third-party crate though, if that's preferred.

Finally, I haven't had the chance to review issues, to figure out if there are any parser issues caused by collisions due the current implementation that may be fixed by this PR - I just stumbled upon the issue while adding "new" keywords to solve #16858

Edit: fixes #16858

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 9, 2024
@lnicola
Copy link
Member

lnicola commented Apr 10, 2024

How come this never fails in practice? Do we never construct a TokenSet with something higher than RET_TYPE?

@Veykril
Copy link
Member

Veykril commented Apr 10, 2024

TokenSets only ever contain token syntax kinds which populate the low integer variants only, so in practice things never collide here. So while this is a nice idea here, we don't really need to do this (should be fine to just debug assert that we only ever pass a token syntax kind to these).

I have a private language project (based on rust-lang/rust-analyzer and rome/tools infra) where I split the syntax kinds into token and node ones actually (and a bunch of other invariant stuff, including generating the first and follow sets) that I might upstream at some point if I find the time for that.

@lnicola
Copy link
Member

lnicola commented Apr 10, 2024

I mean, we're cutting it pretty close with LIFETIME_IDENT (119), but I can't find any higher than that. Of course, T![] doesn't make it any easier to check. But we could at least add an assertion for it.

@Veykril
Copy link
Member

Veykril commented Apr 10, 2024

We should certainly have a (debug_)assert here

@lnicola
Copy link
Member

lnicola commented Apr 10, 2024

But can we fit them all in 0 to 63 though? 😅.

(No, I don't expect it to make any difference)

@Veykril
Copy link
Member

Veykril commented Apr 10, 2024

I mean we are using a u128 rigt now, that suffices no?

@lnicola
Copy link
Member

lnicola commented Apr 10, 2024

Yeah, but if we could fit them in 64 bits, we'd save 8 full bytes of memory for a token set!

@davidsemakula
Copy link
Contributor Author

davidsemakula commented Apr 10, 2024

TokenSets only ever contain token syntax kinds which populate the low integer variants only, so in practice things never collide here.

Gotcha, that explains why it hasn't been problematic before 🙂

So while this is a nice idea here, we don't really need to do this (should be fine to just debug assert that we only ever pass a token syntax kind to these)

As mentioned earlier, I ran into the issue while solving #16858

I attempted to add about ~7 reserved keywords (e.g. abstract, final e.t.c) which triggered the overflow(s).

I could solve the issue another way, but I thought adding the reserved keywords here was the best solution.

@Veykril
Copy link
Member

Veykril commented Apr 10, 2024

I see, adding the keywords as kinds seems like the proper approach here. Can we change the token set to just contain a [u64; 3] then? I don't think splitting it into smaller primitives makes much sense and it's fine to "waste" some bytes here, we don't store these anywhere really, and if they become too big the compiler will probably pass them by ref anyways

@davidsemakula davidsemakula changed the title fix: TokenSet collisions internal: improve TokenSet implementation Apr 10, 2024
@davidsemakula
Copy link
Contributor Author

davidsemakula commented Apr 10, 2024

I've updated the implementation to use [u64; N] (currently [u64; 2], will grow to [u64; 3] when more tokens are added) and added assertion(s) that kind is not higher than the discriminant for the "last" token SyntaxKind.

I'll be pushing the fix for #16858 (i.e. adding missing reserved keywords and related auto-import tests) in a separate PR after this one is merged.

cc @Veykril @lnicola

@Veykril
Copy link
Member

Veykril commented Apr 14, 2024

Codegenerating this seems a bit overkill, let's just keep the manual impl, I expect us to only ever need to bump the backing limit once (when you add the new contextual keywords).

@davidsemakula davidsemakula changed the title internal: improve TokenSet implementation internal: improve TokenSet implementation and add reserved keywords Apr 14, 2024
@davidsemakula
Copy link
Contributor Author

Done, also added fix for #16858

@davidsemakula davidsemakula requested a review from lnicola April 15, 2024 14:31
@Veykril
Copy link
Member

Veykril commented Apr 16, 2024

Thanks!
@bors r+

@bors
Copy link
Contributor

bors commented Apr 16, 2024

📌 Commit 69fe457 has been approved by Veykril

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Apr 16, 2024

⌛ Testing commit 69fe457 with merge e64610d...

@bors
Copy link
Contributor

bors commented Apr 16, 2024

☀️ Test successful - checks-actions
Approved by: Veykril
Pushing e64610d to master...

@bors bors merged commit e64610d into rust-lang:master Apr 16, 2024
@davidsemakula davidsemakula deleted the token-set-collisions branch April 16, 2024 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reserved words in paths are not prepended with 'r#'
5 participants