Skip to content

Conversation

ahoppen
Copy link
Member

@ahoppen ahoppen commented Feb 27, 2024

Fixes a crash and improves the printing of errors on lines that contain multi-byte UTF-8 characters.

Fixes #2507
rdar://123409748

@ahoppen ahoppen requested a review from bnbarham as a code owner February 27, 2024 00:16
…nter

Fixes a crash and improves the printing of errors on lines that contain multi-byte UTF-8 characters.

Fixes swiftlang#2507
rdar://123409748
@ahoppen ahoppen force-pushed the ahoppen/character-column branch from 15f11e8 to 7676255 Compare February 27, 2024 00:18
@ahoppen
Copy link
Member Author

ahoppen commented Feb 27, 2024

@swift-ci Please test

/// For example the 👨‍👩‍👧‍👦 character is considered as a single character, not 25 bytes.
///
/// Both the input and the output column are 1-based.
func characterColumn(ofUtf8Column utf8Column: Int) -> Int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you don't think this should be part of the location converter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it’s quite specific to this use case.

@ahoppen
Copy link
Member Author

ahoppen commented Feb 27, 2024

@swift-ci Please test Windows

Copy link
Contributor

@Matejkob Matejkob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure if considering emojis as one character is correct. Are they two-character symbols?

I believe in this test:

func testEmojiInSourceCode() {
    let source = """
      let 👨‍👩‍👧‍👦 = ;
      """

    let expectedOutput = """
      1 │ let 👨‍👩‍👧‍👦 = ;
        │         ╰─ error: expected expression in variable
      """

    assertStringsEqualWithDiff(annotate(source: source), expectedOutput)
}

the diagnostic should point to ; not the space between = and ;. I can not test it right now, but I think if we change the variable name to regular characters like so:

func testEmojiInSourceCode() {
    let source = """
      let e = ;
      """

    let expectedOutput = """
      1 │ let e = ;
        │         ╰─ error: expected expression in variable
      """

    assertStringsEqualWithDiff(annotate(source: source), expectedOutput)
}

the annotation will be pointing to semi column.

@ahoppen
Copy link
Member Author

ahoppen commented Feb 27, 2024

The problem is that, as far as I know, it depends on the font that renders the font. A font can render an emoji as a single-width character, a double width character, anything in between or however it likes, really. And as far as I know there’s no Unicode specification for it and my thought that assuming that 👨‍👩‍👧‍👦 is rendered as a single character is closer to the truth than assuming that it gets rendered as 25 characters.

@Matejkob
Copy link
Contributor

Ran the emoji test, and indeed, the annotation shifts on difficult Xcode font (also in the terminal), highlighting the complexity of emoji rendering in code.
Screenshot 2024-02-27 at 9 09 49 PM
Your explanation about the unpredictable rendering of emojis due to font differences is super insightful, thanks!

Out of curiosity, I've built this C code:

#include <stdio.h>

int main() {
    printf("Hello, World!🤯")
    return 0;
}

and the annotation in the diagnostic method produced by gcc is shifted as well:
Screenshot 2024-02-27 at 9 11 48 PM

@ahoppen ahoppen merged commit d36f0c1 into swiftlang:main Feb 28, 2024
@ahoppen ahoppen deleted the ahoppen/character-column branch February 28, 2024 03:21
@al45tair
Copy link
Contributor

al45tair commented Apr 8, 2024

The problem is that, as far as I know, it depends on the font that renders the font. A font can render an emoji as a single-width character, a double width character, anything in between or however it likes, really. And as far as I know there’s no Unicode specification for it and my thought that assuming that 👨‍👩‍👧‍👦 is rendered as a single character is closer to the truth than assuming that it gets rendered as 25 characters.

This is mainly to do with terminal emulator behaviour rather than the font (though the font can certainly make things worse if it so desires); generally speaking terminal emulators try to match the behaviour of the wcwidth() function on their respective platform, and if that supports Unicode, it tends to just use the narrow or wide property from the East Asian Width table without doing anything much more complicated. That usually goes wrong for cases where the base character is narrow but some Emoji sequence is in play (so that the result is wide).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DiagnosticsFormatter crashes when source code contains an emoji
4 participants