-
Notifications
You must be signed in to change notification settings - Fork 1.8k
internal: improve TokenSet
implementation and add reserved keywords
#17037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How come this never fails in practice? Do we never construct a |
I have a private language project (based on |
I mean, we're cutting it pretty close with |
We should certainly have a |
But can we fit them all in 0 to 63 though? 😅. (No, I don't expect it to make any difference) |
I mean we are using a u128 rigt now, that suffices no? |
Yeah, but if we could fit them in 64 bits, we'd save 8 full bytes of memory for a token set! |
Gotcha, that explains why it hasn't been problematic before 🙂
As mentioned earlier, I ran into the issue while solving #16858 I attempted to add about ~7 reserved keywords (e.g. I could solve the issue another way, but I thought adding the reserved keywords here was the best solution. |
I see, adding the keywords as kinds seems like the proper approach here. Can we change the token set to just contain a |
TokenSet
collisionsTokenSet
implementation
I've updated the implementation to use I'll be pushing the fix for #16858 (i.e. adding missing reserved keywords and related auto-import tests) in a separate PR after this one is merged. |
Codegenerating this seems a bit overkill, let's just keep the manual impl, I expect us to only ever need to bump the backing limit once (when you add the new contextual keywords). |
59b1a16
to
ace4cf9
Compare
TokenSet
implementationTokenSet
implementation and add reserved keywords
Done, also added fix for #16858 |
ace4cf9
to
8e45912
Compare
1eb0dfe
to
69fe457
Compare
Thanks! |
☀️ Test successful - checks-actions |
The current
TokenSet
type represents "A bit-set ofSyntaxKind
s" as a newtypeu128
.Internally, the flag for each
SyntaxKind
variant in the bit-set is set as the n-th LSB (least significant bit) via a bit-wise left shift operation, where n is the discriminant.Edit: This is problematic because there's currently ~121 token
SyntaxKind
s, so adding new token kinds for missing reserved keywords increases the number of tokenSyntaxKind
s above 128, thus making this "mask" operation overflow.This is problematic because there's currently 266 SyntaxKinds, so this "mask" operation silently overflows in release mode.This leads to a single flag/bit in the bit-set being shared by multiple.SyntaxKind
sThis PR:
TokenSet
fromu128
to[u64; 3]
.[u*; N]
(currently[u16; 17]
) whereu*
can be any desirable unsigned integer type andN
is the minimum array length needed to represent all tokenSyntaxKind
s without any collisionsTokenSet
s only include tokenSyntaxKind
sMoves the definition of the.TokenSet
type to grammar codegen in xtask, so thatN
is adjusted automatically (depending on the chosenu*
"base" type) when newSyntaxKind
s are addedUpdates thetoken_set_works_for_tokens
unit test to include the__LAST
SyntaxKind
as a way of catching overflows in tests.Currentlyu16
is arbitrarily chosen as theu*
"base" type mostly because it strikes a good balance (IMO) between unused bits and readability of the generatedTokenSet
code (especially theunion
method), but I'm open to other suggestions or a better methodology for choosingu*
type.I considered using a third-party crate for the bit-set, but a direct implementation seems simple enough without adding any new dependencies. I'm not strongly opposed to using a third-party crate though, if that's preferred.Finally, I haven't had the chance to review issues, to figure out if there are any parser issues caused by collisions due the current implementation that may be fixed by this PR - I just stumbled upon the issue while adding "new" keywords to solve #16858Edit: fixes #16858