Use character columns instead of UTF-8 columns in the diagnostics printer #2512

ahoppen · 2024-02-27T00:16:48Z

Fixes a crash and improves the printing of errors on lines that contain multi-byte UTF-8 characters.

Fixes #2507
rdar://123409748

…nter Fixes a crash and improves the printing of errors on lines that contain multi-byte UTF-8 characters. Fixes swiftlang#2507 rdar://123409748

ahoppen · 2024-02-27T00:18:45Z

@swift-ci Please test

bnbarham · 2024-02-27T00:24:29Z

Sources/SwiftDiagnostics/DiagnosticsFormatter.swift

+    /// For example the 👨‍👩‍👧‍👦 character is considered as a single character, not 25 bytes.
+    ///
+    /// Both the input and the output column are 1-based.
+    func characterColumn(ofUtf8Column utf8Column: Int) -> Int {


I assume you don't think this should be part of the location converter?

No, I think it’s quite specific to this use case.

ahoppen · 2024-02-27T17:50:19Z

@swift-ci Please test Windows

Matejkob

I'm not so sure if considering emojis as one character is correct. Are they two-character symbols?

I believe in this test:

func testEmojiInSourceCode() {
    let source = """
      let 👨‍👩‍👧‍👦 = ;
      """

    let expectedOutput = """
      1 │ let 👨‍👩‍👧‍👦 = ;
        │         ╰─ error: expected expression in variable
      """

    assertStringsEqualWithDiff(annotate(source: source), expectedOutput)
}

the diagnostic should point to ; not the space between = and ;. I can not test it right now, but I think if we change the variable name to regular characters like so:

func testEmojiInSourceCode() {
    let source = """
      let e = ;
      """

    let expectedOutput = """
      1 │ let e = ;
        │         ╰─ error: expected expression in variable
      """

    assertStringsEqualWithDiff(annotate(source: source), expectedOutput)
}

the annotation will be pointing to semi column.

ahoppen · 2024-02-27T19:34:09Z

The problem is that, as far as I know, it depends on the font that renders the font. A font can render an emoji as a single-width character, a double width character, anything in between or however it likes, really. And as far as I know there’s no Unicode specification for it and my thought that assuming that 👨‍👩‍👧‍👦 is rendered as a single character is closer to the truth than assuming that it gets rendered as 25 characters.

Matejkob · 2024-02-27T20:12:21Z

Ran the emoji test, and indeed, the annotation shifts on difficult Xcode font (also in the terminal), highlighting the complexity of emoji rendering in code.

Your explanation about the unpredictable rendering of emojis due to font differences is super insightful, thanks!

Out of curiosity, I've built this C code:

#include <stdio.h>

int main() {
    printf("Hello, World!🤯")
    return 0;
}

and the annotation in the diagnostic method produced by gcc is shifted as well:

al45tair · 2024-04-08T11:40:12Z

The problem is that, as far as I know, it depends on the font that renders the font. A font can render an emoji as a single-width character, a double width character, anything in between or however it likes, really. And as far as I know there’s no Unicode specification for it and my thought that assuming that 👨‍👩‍👧‍👦 is rendered as a single character is closer to the truth than assuming that it gets rendered as 25 characters.

This is mainly to do with terminal emulator behaviour rather than the font (though the font can certainly make things worse if it so desires); generally speaking terminal emulators try to match the behaviour of the wcwidth() function on their respective platform, and if that supports Unicode, it tends to just use the narrow or wide property from the East Asian Width table without doing anything much more complicated. That usually goes wrong for cases where the base character is narrow but some Emoji sequence is in play (so that the result is wide).

ahoppen requested a review from bnbarham as a code owner February 27, 2024 00:16

Use character columns instead of UTF-8 columns in the diagnostics pri…

7676255

…nter Fixes a crash and improves the printing of errors on lines that contain multi-byte UTF-8 characters. Fixes swiftlang#2507 rdar://123409748

ahoppen force-pushed the ahoppen/character-column branch from 15f11e8 to 7676255 Compare February 27, 2024 00:18

bnbarham approved these changes Feb 27, 2024

View reviewed changes

Matejkob reviewed Feb 27, 2024

View reviewed changes

ahoppen merged commit d36f0c1 into swiftlang:main Feb 28, 2024

ahoppen deleted the ahoppen/character-column branch February 28, 2024 03:21

Matejkob mentioned this pull request Apr 5, 2024

Diagnostic formatter doesn't account for grapheme length in its column annotations #1375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use character columns instead of UTF-8 columns in the diagnostics printer #2512

Use character columns instead of UTF-8 columns in the diagnostics printer #2512

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

bnbarham Feb 27, 2024

Uh oh!

ahoppen Feb 27, 2024

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

Matejkob left a comment

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

Matejkob commented Feb 27, 2024

Uh oh!

al45tair commented Apr 8, 2024

Uh oh!

Uh oh!

Use character columns instead of UTF-8 columns in the diagnostics printer #2512

Use character columns instead of UTF-8 columns in the diagnostics printer #2512

Uh oh!

Conversation

ahoppen commented Feb 27, 2024

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

bnbarham Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

ahoppen Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

Matejkob left a comment

Choose a reason for hiding this comment

Uh oh!

ahoppen commented Feb 27, 2024

Uh oh!

Matejkob commented Feb 27, 2024

Uh oh!

al45tair commented Apr 8, 2024

Uh oh!

Uh oh!