Fix #488 use non-ascii punctuation, not letters #614

cmcaine · 2023-03-26T15:15:27Z

The instructions specify that the only input letters will be the 26 english language letters, but we were testing with unlauts.

I think it's worth changing the test rather than the instructions because it's ambiguous whether adding diacritics to a letter makes it a new letter or not.

I did think about adding a bonus test that makes a decision one way or the other (you could use unicode's NFKD transformation1 to match letters, ignoring diacritics, or to require each unique glyph you could maybe use the NFC transformation + maybe splitting to graphemes (depending on if unicode has a codepoint for all "letters" in the alphabet)), but it's a bit complicated.

The instructions specify that the only input letters will be the 26 english language letters, but we were testing with unlauts. I think it's worth changing the test rather than the instructions because it's ambiguous whether adding diacritics to a letter makes it a new letter or not. I did think about adding a bonus test that makes a decision one way or the other (you could use unicode's NFKD transformation[1] to match letters, ignoring diacritics, or you could maybe use the NFC transformation + maybe splitting to graphemes (depending on if unicode has a codepoint for all "letters" in the alphabet)), but it's a bit complicated. [1]: https://unicode.org/reports/tr15/

SaschaMann · 2023-03-26T17:08:43Z

These aren't combined characters, and the instructions don't specify that the inputs are restricted to English letters. #488 was incorrect at the time and after #556 that hasn't changed, though #556 made it less explicit.

The previous text defined the alphabet for the pangrams to be "ASCII letters a to z, inclusive," not the inputs.

cmcaine · 2023-03-26T18:18:08Z

The instructions now say:

For this exercise we only use the basic letters used in the English alphabet: a to z.

Which doesn't specify that it is talking about only input or only output.

I'm okay with us revising the instructions instead of the tests, but I do think we should revise at least one of the two.

Re: combined characters, I'm referring to how u-with-unlaut can be one unicode codepoint or it can be two (u codepoint and the combining unlaut codepoint).

Unfortunately #2215 introduced an ambiguity for some downstream implementations of this exercise that use non-ASCII inputs that shouldn't be considered part of the alphabet for the purpose of defining pangrams. This PR is meant to clarify that only 'a':'z' are relevant to determine if a sentence is a pangram without restricting the inputs to those characters. See also: exercism/julia#614

SaschaMann · 2023-03-26T19:18:20Z

Re: combined characters, I'm referring to how u-with-unlaut can be one unicode codepoint or it can be two (u codepoint and the combining unlaut codepoint).

I know, I've ran into that issue ages ago on the rust track with some other exercise. But in this case, it's the single codepoint character. Or do you think it's a bad idea that this is visually ambigious?

I'm okay with us revising the instructions instead of the tests, but I do think we should revise at least one of the two.

I think that's a better option, I've opened exercism/problem-specifications#2240

cmcaine · 2023-03-27T10:46:20Z

Re: combined characters, I think normalisation is required to solve the exercise correctly for e.g. German or French sentences. Otherwise solutions will give different results depending on whether the glyph was written with combining diacritics or as a single codepoint, which I think is weird.

julia> "u\u0308" |> collect
2-element Vector{Char}:
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)

julia> using Unicode

julia> Unicode.normalize("u\u0308") |> collect
1-element Vector{Char}:
 'ü': Unicode U+00FC (category Ll: Letter, lowercase)

julia> occursin("ü", "u\u0308")
false

Pangram found here: https://clagnut.com/blog/2380#Arabic

cmcaine · 2023-03-28T16:42:11Z

I've added a commit that changes the testcase to use an arabic language pangram so that we can avoid specifying any behaviour around latin characters with diacritcs.

SaschaMann

That's a nice solution to that issue

SaschaMann · 2023-03-28T18:39:06Z

Thanks!

* pangram: Clarify instructions Unfortunately #2215 introduced an ambiguity for some downstream implementations of this exercise that use non-ASCII inputs that shouldn't be considered part of the alphabet for the purpose of defining pangrams. This PR is meant to clarify that only 'a':'z' are relevant to determine if a sentence is a pangram without restricting the inputs to those characters. See also: exercism/julia#614 * Update exercises/pangram/instructions.md Co-authored-by: Colin Caine <[email protected]> * Update exercises/pangram/instructions.md Co-authored-by: Colin Caine <[email protected]> --------- Co-authored-by: Colin Caine <[email protected]>

cmcaine requested a review from a team as a code owner March 26, 2023 15:15

SaschaMann mentioned this pull request Mar 26, 2023

pangram: Clarify instructions exercism/problem-specifications#2240

Merged

Avoid diacritics ambiguities in unicode testcase

e0f7db9

Pangram found here: https://clagnut.com/blog/2380#Arabic

SaschaMann approved these changes Mar 28, 2023

View reviewed changes

SaschaMann merged commit f542e6c into main Mar 28, 2023

SaschaMann deleted the fix-488 branch March 28, 2023 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix #488 use non-ascii punctuation, not letters #614

Fix #488 use non-ascii punctuation, not letters #614

Uh oh!

cmcaine commented Mar 26, 2023 •

edited

Loading

Uh oh!

SaschaMann commented Mar 26, 2023 •

edited

Loading

Uh oh!

cmcaine commented Mar 26, 2023

Uh oh!

SaschaMann commented Mar 26, 2023 •

edited

Loading

Uh oh!

cmcaine commented Mar 27, 2023

Uh oh!

cmcaine commented Mar 28, 2023

Uh oh!

SaschaMann left a comment

Uh oh!

SaschaMann commented Mar 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix #488 use non-ascii punctuation, not letters #614

Fix #488 use non-ascii punctuation, not letters #614

Uh oh!

Conversation

cmcaine commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SaschaMann commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmcaine commented Mar 26, 2023

Uh oh!

SaschaMann commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmcaine commented Mar 27, 2023

Uh oh!

cmcaine commented Mar 28, 2023

Uh oh!

SaschaMann left a comment

Choose a reason for hiding this comment

Uh oh!

SaschaMann commented Mar 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cmcaine commented Mar 26, 2023 •

edited

Loading

SaschaMann commented Mar 26, 2023 •

edited

Loading

SaschaMann commented Mar 26, 2023 •

edited

Loading