parser/lexer: bump to Unicode 17, use faster unicode-ident #148321

Marcondiro · 2025-10-31T09:23:37Z

Hello,

Bump the unicode version used by lexer/parser to 17.0.0 by updating:

unicode-normalization to 0.1.25
unicode-properties to 0.1.4
unicode-width to 0.2.2

and by replacing unicode-xid with unicode-ident which is also 6 times faster.
I think it might be worth to run the benchmarks to double check.
(unicode-ident is already in src/tools/tidy/src/deps.rs)

Thanks!

rustbot · 2025-10-31T09:23:41Z

The list of allowed third-party dependencies may have been modified! You must ensure that any new dependencies have compatible licenses before merging.

cc @davidtwco, @wesleywiser

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

rustbot · 2025-10-31T09:23:43Z

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Replace unicode-xid with unicode-ident which is 6 times faster

rustbot · 2025-10-31T10:20:42Z

If the Unicode version changes are intentional,
it should also be updated in the reference at
https://github.com/rust-lang/reference/blob/HEAD/src/identifiers.md.

cc @ehuss

Kobzol · 2025-10-31T13:25:13Z

@bors try @rust-timer queue

(Parsing could be affected too).

parser/lexer: bump to Unicode 17, use faster unicode-ident

rust-bors · 2025-10-31T15:40:35Z

☀️ Try build successful (CI)
Build commit: 988451c (988451ce73b832a095adca69acf309ce27a2f54d, parent: 23c7bad921fb7163de37ea680bed317deaa03fda)

rust-timer · 2025-10-31T17:22:53Z

Finished benchmarking commit (988451c): comparison URL.

Overall result: ❌ regressions - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	0.2%	[0.1%, 0.3%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (primary 2.6%, secondary 3.6%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

	mean	range	count
Regressions ❌ (primary)	2.6%	[2.6%, 2.6%]	1
Regressions ❌ (secondary)	3.6%	[3.6%, 3.6%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.6%	[2.6%, 2.6%]	1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 473.971s -> 474.835s (0.18%)
Artifact size: 390.89 MiB -> 390.89 MiB (-0.00%)

clarfonthey · 2025-11-04T05:20:59Z

Is there a reason why the reference explicitly specifies the Unicode version in a way that makes it feel like updating that version is a nontrivial change?

i.e., is there a reason why it does not clarify that the Unicode version in the compiler is allowed to be (and should be) bumped whenever Unicode releases a new version, and to simply say something like "it is version N as of Rust 1.M"?

joshtriplett · 2025-11-05T16:16:40Z

I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to XID_Start/XID_Continue and any related changes to confusables that overlap with XID_Start/XID_Continue.

I don't think that review is best done in lang. I think we should delgate that to whichever team is making sure of the above. (Is that T-compiler?) So, ideally, I'd love to see a proposal to lang requesting a delegation to take responsibility for the above.

That said, let's go ahead and sign off on this change to unblock it.

traviscross · 2025-11-05T16:19:49Z

This fits the model of "reverting this might be a breaking change", which is our standard for lang FCP barring a specific delegation, and this is part of the language definition, so let's do the proper thing and propose FCP here.

If someone wants to write up a request for a specific delegation, as @joshtriplett mentioned above, for us to approve, we'd be interested in reviewing that.

@rfcbot fcp merge

rust-rfcbot · 2025-11-05T16:19:51Z

Team member @traviscross has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

cc @rust-lang/lang-advisors: FCP proposed for lang, please feel free to register concerns.
See this document for info about what commands tagged team members can give me.

nikomatsakis · 2025-11-05T16:21:47Z

@rfcbot reviewed

I don't particularly think this needs a lang FCP per se, but I agree that it needs somebody to FCP.

rust-rfcbot · 2025-11-05T16:21:55Z

🔔 This is now entering its final comment period, as per the review above. 🔔

ehuss · 2025-11-05T16:44:45Z

I wanted to mention a minor concern that this may make it more difficult to keep the version in sync between normalization and xid. Previously these crates were maintained by the same org, and were generally bumped around the same time. With unicode-ident, it may get its version pushed at different times (and I seem to recall it updating more quickly).

That could be managed by delaying updates, but could be a little awkward since that requires manual intervention. (Fortunately Unicode doesn't update often.)

clarfonthey · 2025-11-05T17:32:51Z

I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to XID_Start/XID_Continue and any related changes to confusables that overlap with XID_Start/XID_Continue.

This seems to demonstrate a misunderstanding of Unicode's stability policy. Sure, I agree that Unicode versions should not be bumped non-uniformly due to the potential overlap between confusables and identifier characters, but since a lot of trust has been forwarded toward the consortium in choosing the best options on identifiers and confusables, there shouldn't be any concern bumping the Unicode version. XID is already a very stable standard with explicit guarantees to not remove characters between versions.

This fits the model of "reverting this might be a breaking change", which is our standard for lang FCP barring a specific delegation, and this is part of the language definition, so let's do the proper thing and propose FCP here.

While I think this is a reasonable and conservative approach, I think that there is less gained by having an FCP over setting up an explicit playbook for these kinds of updates that includes all necessary changes (like making sure everything is using the same Unicode version), and said playbook could be subject to an RFC and/or FCP.

Basically, my point is that this change is one we should make sure is being done correctly, but I don't think that having every lang team member sign off is the right way to ensure that necessarily.

Note, the switch for what crate is managing the Unicode identifier data is something that I think deserves an FCP, or at least careful scrutiny. But bumping Unicode versions isn't IMHO.

scottmcm · 2025-11-05T17:37:30Z

@ehuss I would emphasize that the lang FCP here is just about bumping the unicode version in the abstract. If compiler has concerns about the crates used -- especially if the "6 times faster" crate is showing a perf regression -- then compiler should absolutely feel empowered to reject the implementation and ask for something else.

(I personally haven't investigated the implementation enough to have an opinion at this time.)

Marcondiro · 2025-11-06T08:37:33Z

@ehuss the unicode version misalignment between crates is a valid concern, but I don't think that relying on unicode-rs' crates being bumped at the same time is a solution. For instance, unicode-width got updated to Unicode 16 few months after unicode-xid got its update. Or even right now the latest unicode-xid on crates.io is still based on Unicode 16 while normalization is on 17.

rustbot added A-tidy Area: The tidy tool S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 31, 2025

rustbot assigned Mark-Simulacrum Oct 31, 2025

rustbot added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 31, 2025

This comment has been minimized.

Sign in to view

parser/lexer: bump to Unicode 17, use faster unicode-ident

83d95c8

Replace unicode-xid with unicode-ident which is 6 times faster

Marcondiro force-pushed the master branch from a1d821c to 83d95c8 Compare October 31, 2025 10:20

Marcondiro mentioned this pull request Oct 31, 2025

identifiers: bump Unicode from 16 to 17 rust-lang/reference#2071

Open

This comment has been minimized.

Sign in to view

rust-bors bot added a commit that referenced this pull request Oct 31, 2025

Auto merge of #148321 - Marcondiro:master, r=<try>

988451c

parser/lexer: bump to Unicode 17, use faster unicode-ident

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2025

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2025

crlf0710 added the A-Unicode Area: Unicode label Nov 4, 2025

traviscross removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Nov 5, 2025

rust-rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. labels Nov 5, 2025

rust-rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Nov 5, 2025

parser/lexer: bump to Unicode 17, use faster unicode-ident #148321

Are you sure you want to change the base?

parser/lexer: bump to Unicode 17, use faster unicode-ident #148321

Conversation

Marcondiro commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Oct 31, 2025

Uh oh!

rustbot commented Oct 31, 2025

Uh oh!

This comment has been minimized.

rustbot commented Oct 31, 2025

Uh oh!

Kobzol commented Oct 31, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

rust-bors bot commented Oct 31, 2025

Uh oh!

This comment has been minimized.

rust-timer commented Oct 31, 2025

Overall result: ❌ regressions - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Uh oh!

clarfonthey commented Nov 4, 2025

Uh oh!

joshtriplett commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traviscross commented Nov 5, 2025

Uh oh!

rust-rfcbot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikomatsakis commented Nov 5, 2025

Uh oh!

rust-rfcbot commented Nov 5, 2025

Uh oh!

ehuss commented Nov 5, 2025

Uh oh!

clarfonthey commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scottmcm commented Nov 5, 2025

Uh oh!

Marcondiro commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Marcondiro commented Oct 31, 2025 •

edited

Loading

joshtriplett commented Nov 5, 2025 •

edited

Loading

rust-rfcbot commented Nov 5, 2025 •

edited

Loading

clarfonthey commented Nov 5, 2025 •

edited

Loading