Skip to content

Conversation

@cmcaine
Copy link
Contributor

@cmcaine cmcaine commented Mar 26, 2023

The instructions specify that the only input letters will be the 26 english language letters, but we were testing with unlauts.

I think it's worth changing the test rather than the instructions because it's ambiguous whether adding diacritics to a letter makes it a new letter or not.

I did think about adding a bonus test that makes a decision one way or the other (you could use unicode's NFKD transformation1 to match letters, ignoring diacritics, or to require each unique glyph you could maybe use the NFC transformation + maybe splitting to graphemes (depending on if unicode has a codepoint for all "letters" in the alphabet)), but it's a bit complicated.

The instructions specify that the only input letters will be the 26
english language letters, but we were testing with unlauts.

I think it's worth changing the test rather than the instructions
because it's ambiguous whether adding diacritics to a letter makes it a
new letter or not.

I did think about adding a bonus test that makes a decision one way or
the other (you could use unicode's NFKD transformation[1] to match
letters, ignoring diacritics, or you could maybe use the NFC
transformation + maybe splitting to graphemes (depending on if unicode
has a codepoint for all "letters" in the alphabet)), but it's a bit
complicated.

[1]: https://unicode.org/reports/tr15/
@cmcaine cmcaine requested a review from a team as a code owner March 26, 2023 15:15
@SaschaMann
Copy link
Contributor

SaschaMann commented Mar 26, 2023

These aren't combined characters, and the instructions don't specify that the inputs are restricted to English letters. #488 was incorrect at the time and after #556 that hasn't changed, though #556 made it less explicit.

The previous text defined the alphabet for the pangrams to be "ASCII letters a to z, inclusive," not the inputs.

@cmcaine
Copy link
Contributor Author

cmcaine commented Mar 26, 2023

The instructions now say:

For this exercise we only use the basic letters used in the English alphabet: a to z.

Which doesn't specify that it is talking about only input or only output.

I'm okay with us revising the instructions instead of the tests, but I do think we should revise at least one of the two.

Re: combined characters, I'm referring to how u-with-unlaut can be one unicode codepoint or it can be two (u codepoint and the combining unlaut codepoint).

SaschaMann added a commit to exercism/problem-specifications that referenced this pull request Mar 26, 2023
Unfortunately #2215 introduced an ambiguity for some downstream implementations of this exercise that use non-ASCII inputs that shouldn't be considered part of the alphabet for the purpose of defining pangrams.

This PR is meant to clarify that only 'a':'z' are relevant to determine if a sentence is a pangram without restricting the inputs to those characters.

See also: exercism/julia#614
@SaschaMann
Copy link
Contributor

SaschaMann commented Mar 26, 2023

Re: combined characters, I'm referring to how u-with-unlaut can be one unicode codepoint or it can be two (u codepoint and the combining unlaut codepoint).

I know, I've ran into that issue ages ago on the rust track with some other exercise. But in this case, it's the single codepoint character. Or do you think it's a bad idea that this is visually ambigious?

I'm okay with us revising the instructions instead of the tests, but I do think we should revise at least one of the two.

I think that's a better option, I've opened exercism/problem-specifications#2240

@cmcaine
Copy link
Contributor Author

cmcaine commented Mar 27, 2023

Re: combined characters, I think normalisation is required to solve the exercise correctly for e.g. German or French sentences. Otherwise solutions will give different results depending on whether the glyph was written with combining diacritics or as a single codepoint, which I think is weird.

julia> "u\u0308" |> collect
2-element Vector{Char}:
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)

julia> using Unicode

julia> Unicode.normalize("u\u0308") |> collect
1-element Vector{Char}:
 'ü': Unicode U+00FC (category Ll: Letter, lowercase)

julia> occursin("ü", "u\u0308")
false

@cmcaine
Copy link
Contributor Author

cmcaine commented Mar 28, 2023

I've added a commit that changes the testcase to use an arabic language pangram so that we can avoid specifying any behaviour around latin characters with diacritcs.

Copy link
Contributor

@SaschaMann SaschaMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice solution to that issue

@SaschaMann SaschaMann merged commit f542e6c into main Mar 28, 2023
@SaschaMann SaschaMann deleted the fix-488 branch March 28, 2023 18:39
@SaschaMann
Copy link
Contributor

Thanks!

kytrinyx pushed a commit to exercism/problem-specifications that referenced this pull request Mar 29, 2023
* pangram: Clarify instructions

Unfortunately #2215 introduced an ambiguity for some downstream implementations of this exercise that use non-ASCII inputs that shouldn't be considered part of the alphabet for the purpose of defining pangrams.

This PR is meant to clarify that only 'a':'z' are relevant to determine if a sentence is a pangram without restricting the inputs to those characters.

See also: exercism/julia#614

* Update exercises/pangram/instructions.md

Co-authored-by: Colin Caine <[email protected]>

* Update exercises/pangram/instructions.md

Co-authored-by: Colin Caine <[email protected]>

---------

Co-authored-by: Colin Caine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants