-
Notifications
You must be signed in to change notification settings - Fork 13.9k
parser/lexer: bump to Unicode 17, use faster unicode-ident #148321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The list of allowed third-party dependencies may have been modified! You must ensure that any new dependencies have compatible licenses before merging. These commits modify the If this was unintentional then you should revert the changes before this PR is merged. |
|
rustbot has assigned @Mark-Simulacrum. Use |
This comment has been minimized.
This comment has been minimized.
Replace unicode-xid with unicode-ident which is 6 times faster
|
If the Unicode version changes are intentional, cc @ehuss |
|
@bors try @rust-timer queue (Parsing could be affected too). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
parser/lexer: bump to Unicode 17, use faster unicode-ident
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (988451c): comparison URL. Overall result: ❌ regressions - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (primary 2.6%, secondary 3.6%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 473.971s -> 474.835s (0.18%) |
|
Is there a reason why the reference explicitly specifies the Unicode version in a way that makes it feel like updating that version is a nontrivial change? i.e., is there a reason why it does not clarify that the Unicode version in the compiler is allowed to be (and should be) bumped whenever Unicode releases a new version, and to simply say something like "it is version N as of Rust 1.M"? |
|
I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to I don't think that review is best done in lang. I think we should delgate that to whichever team is making sure of the above. (Is that T-compiler?) So, ideally, I'd love to see a proposal to lang requesting a delegation to take responsibility for the above. That said, let's go ahead and sign off on this change to unblock it. |
|
This fits the model of "reverting this might be a breaking change", which is our standard for lang FCP barring a specific delegation, and this is part of the language definition, so let's do the proper thing and propose FCP here. If someone wants to write up a request for a specific delegation, as @joshtriplett mentioned above, for us to approve, we'd be interested in reviewing that. @rfcbot fcp merge |
|
Team member @traviscross has proposed to merge this. The next step is review by the rest of the tagged team members: No concerns currently listed. Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns. |
|
@rfcbot reviewed I don't particularly think this needs a lang FCP per se, but I agree that it needs somebody to FCP. |
|
🔔 This is now entering its final comment period, as per the review above. 🔔 |
|
I wanted to mention a minor concern that this may make it more difficult to keep the version in sync between normalization and xid. Previously these crates were maintained by the same org, and were generally bumped around the same time. With unicode-ident, it may get its version pushed at different times (and I seem to recall it updating more quickly). That could be managed by delaying updates, but could be a little awkward since that requires manual intervention. (Fortunately Unicode doesn't update often.) |
This seems to demonstrate a misunderstanding of Unicode's stability policy. Sure, I agree that Unicode versions should not be bumped non-uniformly due to the potential overlap between confusables and identifier characters, but since a lot of trust has been forwarded toward the consortium in choosing the best options on identifiers and confusables, there shouldn't be any concern bumping the Unicode version. XID is already a very stable standard with explicit guarantees to not remove characters between versions.
While I think this is a reasonable and conservative approach, I think that there is less gained by having an FCP over setting up an explicit playbook for these kinds of updates that includes all necessary changes (like making sure everything is using the same Unicode version), and said playbook could be subject to an RFC and/or FCP. Basically, my point is that this change is one we should make sure is being done correctly, but I don't think that having every lang team member sign off is the right way to ensure that necessarily. Note, the switch for what crate is managing the Unicode identifier data is something that I think deserves an FCP, or at least careful scrutiny. But bumping Unicode versions isn't IMHO. |
|
@ehuss I would emphasize that the lang FCP here is just about bumping the unicode version in the abstract. If compiler has concerns about the crates used -- especially if the "6 times faster" crate is showing a perf regression -- then compiler should absolutely feel empowered to reject the implementation and ask for something else. (I personally haven't investigated the implementation enough to have an opinion at this time.) |
|
@ehuss the unicode version misalignment between crates is a valid concern, but I don't think that relying on unicode-rs' crates being bumped at the same time is a solution. For instance, |
Hello,
Bump the unicode version used by lexer/parser to 17.0.0 by updating:
unicode-normalizationto 0.1.25unicode-propertiesto 0.1.4unicode-widthto 0.2.2and by replacing
unicode-xidwithunicode-identwhich is also 6 times faster.I think it might be worth to run the benchmarks to double check.
(
unicode-identis already insrc/tools/tidy/src/deps.rs)Thanks!