Skip to content

Conversation

@Marcondiro
Copy link
Contributor

@Marcondiro Marcondiro commented Oct 31, 2025

Hello,

Bump the unicode version used by lexer/parser to 17.0.0 by updating:

  • unicode-normalization to 0.1.25
  • unicode-properties to 0.1.4
  • unicode-width to 0.2.2

and by replacing unicode-xid with unicode-ident which is also 6 times faster.
I think it might be worth to run the benchmarks to double check.
(unicode-ident is already in src/tools/tidy/src/deps.rs)

Thanks!

@rustbot
Copy link
Collaborator

rustbot commented Oct 31, 2025

The list of allowed third-party dependencies may have been modified! You must ensure that any new dependencies have compatible licenses before merging.

cc @davidtwco, @wesleywiser

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

@rustbot rustbot added A-tidy Area: The tidy tool S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 31, 2025
@rustbot rustbot added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 31, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 31, 2025

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rust-log-analyzer

This comment has been minimized.

Replace unicode-xid with unicode-ident which is 6 times faster
@rustbot
Copy link
Collaborator

rustbot commented Oct 31, 2025

If the Unicode version changes are intentional,
it should also be updated in the reference at
https://github.com/rust-lang/reference/blob/HEAD/src/identifiers.md.

cc @ehuss

@Kobzol
Copy link
Member

Kobzol commented Oct 31, 2025

@bors try @rust-timer queue

(Parsing could be affected too).

@rust-timer

This comment has been minimized.

@rust-bors

This comment has been minimized.

rust-bors bot added a commit that referenced this pull request Oct 31, 2025
parser/lexer: bump to Unicode 17, use faster unicode-ident
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2025
@rust-bors
Copy link

rust-bors bot commented Oct 31, 2025

☀️ Try build successful (CI)
Build commit: 988451c (988451ce73b832a095adca69acf309ce27a2f54d, parent: 23c7bad921fb7163de37ea680bed317deaa03fda)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (988451c): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.2% [0.1%, 0.3%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (primary 2.6%, secondary 3.6%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
2.6% [2.6%, 2.6%] 1
Regressions ❌
(secondary)
3.6% [3.6%, 3.6%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.6% [2.6%, 2.6%] 1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 473.971s -> 474.835s (0.18%)
Artifact size: 390.89 MiB -> 390.89 MiB (-0.00%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2025
@traviscross traviscross added T-lang Relevant to the language team I-lang-nominated Nominated for discussion during a lang team meeting. I-lang-easy-decision Issue: The decision needed by the team is conjectured to be easy; this does not imply nomination I-lang-radar Items that are on lang's radar and will need eventual work or consideration. P-lang-drag-1 Lang team prioritization drag level 1. https://rust-lang.zulipchat.com/#narrow/channel/410516-t-lang labels Nov 2, 2025
@clarfonthey
Copy link
Contributor

Is there a reason why the reference explicitly specifies the Unicode version in a way that makes it feel like updating that version is a nontrivial change?

i.e., is there a reason why it does not clarify that the Unicode version in the compiler is allowed to be (and should be) bumped whenever Unicode releases a new version, and to simply say something like "it is version N as of Rust 1.M"?

@crlf0710 crlf0710 added the A-Unicode Area: Unicode label Nov 4, 2025
@traviscross traviscross removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Nov 5, 2025
@joshtriplett
Copy link
Member

joshtriplett commented Nov 5, 2025

I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to XID_Start/XID_Continue and any related changes to confusables that overlap with XID_Start/XID_Continue.

I don't think that review is best done in lang. I think we should delgate that to whichever team is making sure of the above. (Is that T-compiler?) So, ideally, I'd love to see a proposal to lang requesting a delegation to take responsibility for the above.

That said, let's go ahead and sign off on this change to unblock it.

@traviscross
Copy link
Contributor

This fits the model of "reverting this might be a breaking change", which is our standard for lang FCP barring a specific delegation, and this is part of the language definition, so let's do the proper thing and propose FCP here.

If someone wants to write up a request for a specific delegation, as @joshtriplett mentioned above, for us to approve, we'd be interested in reviewing that.

@rfcbot fcp merge

@rust-rfcbot
Copy link
Collaborator

rust-rfcbot commented Nov 5, 2025

Team member @traviscross has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns.
See this document for info about what commands tagged team members can give me.

@rust-rfcbot rust-rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Nov 5, 2025
@traviscross traviscross removed I-lang-nominated Nominated for discussion during a lang team meeting. I-lang-easy-decision Issue: The decision needed by the team is conjectured to be easy; this does not imply nomination P-lang-drag-1 Lang team prioritization drag level 1. https://rust-lang.zulipchat.com/#narrow/channel/410516-t-lang labels Nov 5, 2025
@nikomatsakis
Copy link
Contributor

@rfcbot reviewed

I don't particularly think this needs a lang FCP per se, but I agree that it needs somebody to FCP.

@rust-rfcbot rust-rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Nov 5, 2025
@rust-rfcbot
Copy link
Collaborator

🔔 This is now entering its final comment period, as per the review above. 🔔

@ehuss
Copy link
Contributor

ehuss commented Nov 5, 2025

I wanted to mention a minor concern that this may make it more difficult to keep the version in sync between normalization and xid. Previously these crates were maintained by the same org, and were generally bumped around the same time. With unicode-ident, it may get its version pushed at different times (and I seem to recall it updating more quickly).

That could be managed by delaying updates, but could be a little awkward since that requires manual intervention. (Fortunately Unicode doesn't update often.)

@clarfonthey
Copy link
Contributor

clarfonthey commented Nov 5, 2025

I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to XID_Start/XID_Continue and any related changes to confusables that overlap with XID_Start/XID_Continue.

This seems to demonstrate a misunderstanding of Unicode's stability policy. Sure, I agree that Unicode versions should not be bumped non-uniformly due to the potential overlap between confusables and identifier characters, but since a lot of trust has been forwarded toward the consortium in choosing the best options on identifiers and confusables, there shouldn't be any concern bumping the Unicode version. XID is already a very stable standard with explicit guarantees to not remove characters between versions.

This fits the model of "reverting this might be a breaking change", which is our standard for lang FCP barring a specific delegation, and this is part of the language definition, so let's do the proper thing and propose FCP here.

While I think this is a reasonable and conservative approach, I think that there is less gained by having an FCP over setting up an explicit playbook for these kinds of updates that includes all necessary changes (like making sure everything is using the same Unicode version), and said playbook could be subject to an RFC and/or FCP.

Basically, my point is that this change is one we should make sure is being done correctly, but I don't think that having every lang team member sign off is the right way to ensure that necessarily.

Note, the switch for what crate is managing the Unicode identifier data is something that I think deserves an FCP, or at least careful scrutiny. But bumping Unicode versions isn't IMHO.

@scottmcm
Copy link
Member

scottmcm commented Nov 5, 2025

@ehuss I would emphasize that the lang FCP here is just about bumping the unicode version in the abstract. If compiler has concerns about the crates used -- especially if the "6 times faster" crate is showing a perf regression -- then compiler should absolutely feel empowered to reject the implementation and ask for something else.

(I personally haven't investigated the implementation enough to have an opinion at this time.)

@Marcondiro
Copy link
Contributor Author

@ehuss the unicode version misalignment between crates is a valid concern, but I don't think that relying on unicode-rs' crates being bumped at the same time is a solution. For instance, unicode-width got updated to Unicode 16 few months after unicode-xid got its update. Or even right now the latest unicode-xid on crates.io is still based on Unicode 16 while normalization is on 17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tidy Area: The tidy tool A-Unicode Area: Unicode disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. I-lang-radar Items that are on lang's radar and will need eventual work or consideration. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-lang Relevant to the language team

Projects

None yet

Development

Successfully merging this pull request may close these issues.