-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
std.fmt: Clarify that width is measured in Unicode Codepoints. #18536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, a change just went in to allow any unicode codepoint to be used for the fill "character" ( 279607c ) is that wrong too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one is a bit tricky. It's a counter-intuitive UI but it's technically OK since the implementation does not need to be Unicode-aware to use an arbitrary sequence of bytes as a fill character. It may as well be
fill_bytes: []const u8and the implementation assumes that all those bytes are to be treated as one width unit. However, it's not worth having that field be a reference to external memory, so having it be a fixed size integer is worth the limitation. It's similar rational to Zig's character literals, which arecomptime_intand support any single Unicode codepoint, but do not for example support 👨👩👧👦 which is 4 codepoints joined with 3 Zero Width Join codepoints, because the purpose of a character literal is to be an integer.This kind of unfortunate complexity (the fact that there is not a single integer corresponding to every Unicode character) is one reason I have no intention for Zig to depend on the large amount of volatile data needed to keep up with Unicode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Points up how the term "character" is too ambiguous -- Unicode itself doesn't define it for good reason. The example of 👨👩👧👦 is what's more technically termed a grapheme cluster (that this looks like a single "character" here is entirely dependent on the font and the display context (web browser).
The zig term "character literal" trips some people up, because it's actually a "Unicode code point" literal. it would be nice to transitions discussion and the docs to use this term, even if it departs from the "C" terminology. Lots of folks wish for a "character cell" model for text formatting, but this always falls apart in the face of combining characters, worldwide text, fonts, and rendering technology. These is well beyond the scope of the standard library. What's most often of concern when writing format-to-buffer is the storage for the data, so stick to bytes for sizes and return values that give you resulting sizes of things. The fill quantity perhaps should not be bytes or characters, but a count of repetitions of the fill codepoint. Even if you have a Unicode character database, that is not sufficient in general for text layout. Counts of Unicode codepoints are in general not useful, and tends to encourage the wrong mental model of worldwide (Unicode) text.
Andrew I think has drawn just the right lines of compromise for fmt functionality.